GPU Activity Prediction using Representation Learning by Raghavan, Aswin et al.
GPU Activity Prediction using Representation Learning
Aswin Raghavan, Mohamed Amer, Timothy Shields, David Zhang, Sek Chai FIRSTNAME.LASTNAME@SRI.COM
SRI International, 201 Washington Rd. Princeton, NJ08540
Abstract
GPU activity prediction is an important and com-
plex problem. This is due to the high level of
contention among thousands of parallel threads.
This problem was mostly addressed using heuris-
tics. We propose a representation learning ap-
proach to address this problem. We model any
performance metric as a temporal function of the
executed instructions with the intuition that the
flow of instructions can be identified as distinct
activities of the code. Our experiments show
high accuracy and non-trivial predictive power of
representation learning on a benchmark.
1. Introduction
The performance of a computing system relies on the sus-
tained operational throughput. Sustained operation is be-
coming harder to achieve as computation workloads be-
come more complex. At the same time, with the end of
Dennard scaling (Esmaeilzadeh & et. al., 2011), and the
increasing abundance of Big Data, it is imperative to mini-
mize wasted processor effort in order to achieve processor
reliability and scalability (Wulf & McKee, 1995; Patterson,
2006; Bergman et al., 2008).
The goal of this paper is to demonstrate the efficacy of ma-
chine learning to designing computing systems. We antici-
pate that classical problems such as branch predictions and
cache management can be re-evaluated such that heuris-
tically based approaches (Yeh & Patt, 1991; Fung et al.,
2007; Li et al., 2015) can be replaced with a machine learn-
ing approach (Leng et al., 2015). It is well understood
in the computer architecture community that processor be-
havior is highly complex and data dependent. Processor
data is widely available in the form of benchmarks (Che
et al., 2010), and algorithms are extensively compared us-
ing these benchmarks (Blem et al., 2011).
We choose a well-understood and well-defined problem
Proceedings of the 33 rd International Conference on Machine
Learning, New York, NY, USA, 2016. JMLR: W&CP volume
48. Copyright 2016 by the author(s).
of predicting GPU Cache Misses (Li et al., 2015; Chen
et al., 2014) for this paper. General Purpose GPU (GPGPU)
achieve high throughput execution via a high level of par-
allelism. Predicting GPU Cache Misses is complex due to
the high level of contention among thousands of threads.
Cache contention is a bottleneck for parallel execution
when many threads are waiting for cache operation, caus-
ing the addition of more threads (or cores) to be detrimen-
tal. Predicting whether a cache miss is about to occur is
useful for better cache management such as cache bypass-
ing (Chen et al., 2014), pre-fetching (Lee et al., 2010),
prioritized allocation (Li et al., 2015) etc. Further, cache
misses indirectly cause increased energy and power usage
(Leng et al., 2015) because of second order effects beyond
memory latency. In principle, our approach is amenable
to predict these higher order events (such as voltage scal-
ing (Leng et al., 2015) and faults (S. Chai, 2014)) either
directly (Tiwari et al., 1994) or via hierarchical modeling.
We propose a new model that can predict key processor
events that limit processor throughput. We propose a new
variant of the Conditional Restricted Boltzmann Machines
(CRBMs) (Taylor et al., 2011) to directly address sys-
tem performance and reliability. CRBMs efficiently model
short-term temporal phenomenon. Prior work used a per-
ceptron to predict cache misses (Leng et al., 2013). Un-
like their approach, our model accounts for time-series and
count data.
Our approach assumes the availability of a simulator for
CUDA (Bakhoda et al., 2009) that can generate a dataset
for training our model. In principle, this approach can be
used in real-time by incrementally augmenting the dataset.
Multiple repeated executions can even lead to increased
predictive power because more data is available for ma-
chine learning. Our predictor is naturally agnostic to the
hardware and architecture as it relies on execution traces.
Our contributions:
• Prediction of processor events as temporally-extended
activities in a stream of instructions.
• Using Discriminative Conditional Restrictive Boltz-
mann Machines (DCRBM) to learn processor states.
ar
X
iv
:1
70
3.
09
14
6v
1 
 [c
s.L
G]
  2
7 M
ar 
20
17
GPU Activity Prediction using Representation Learning
2. Literature Review
Cache Miss Prediction: There is a large body of research
on branch prediction to improve cache performance. Sim-
ple static solutions can achieve 80% correct prediction by
analyzing control-flow and static heuristics (Ball & Larus.,
1993). Dynamic solutions (Yeh & Patt, 1991) are more
complicated as they are implemented with counters and ta-
bles to store branch history based on branch memory ad-
dress. Other approaches that are data-driven use percep-
trons (Jime´nez & Lin, 2001) and feed-forward neural net-
works (Calder & et al., 1997).
Representation Learning: Restricted Boltzmann Ma-
chines (RBMs) form the building blocks in energy based
deep networks (Hinton et al., 2006; Salakhutdinov & Hin-
ton, 2006). Recently, temporal models based on deep net-
works have been proposed, capable of modeling a more
temporally rich set of problems. These include Conditional
RBMs (CRBMs) (Taylor et al., 2011) and Temporal RBMs
(TRBMs) (Sutskever & Hinton, 2007). CRBMs have been
used in both visual (Taylor et al., 2011) and audio (Mo-
hamed & Hinton, 2009). In addition to efficiently model-
ing time-series data, RBMs were formulated to be trained
discriminatively for classification (Larochelle & Bengio,
2008), and model word-count vectors from a large set of
documents (Salakhutdinov & Hinton, 2009).
3. Model
The input to our model (called visible units) is an instruc-
tion mix per time step, ie. the histogram of counts of in-
structions being excuted, obtained from the GPU simulator.
The labels are any chosen performance metric also output
by the simulator.
We discuss a sequence of models, gradually increasing in
complexity, so that the different components of our model
can be understood in isolation. We start with the ba-
sic CRBM model, then we extend to the discriminative
DCRBM, and finally CountDCRBM.
Conditional Restricted Boltzmann Machines: CRBMs
(Taylor et al., 2011), are a natural extension of RBMs for
modeling short term temporal dependencies. A CRBM is
an RBM which takes into account history from the previous
time instances t − N, . . . , t − 1 at time t. This is done
by treating the previous time instances as additional inputs.
Doing so does not complicate inference. v is a vector of
visible nodes, h is a vector of hidden nodes, and v<t is the
visible vectors from the previous N time instances, which
influences the current visible and hidden vectors. EC is
the energy function, and Z is the partition function. The
parameters θ to be learned are a and b the biases for v and
h respectively and the weights W . A and B are matrices
of concatenated vectors of previous time instances of a and
Figure 1. This figure illustrates the CountDCRBM model.
b. The CRBM is fully connected between layers, with no
lateral connections. This architecture implies that v and h
are factorial given one of the two vectors. This allows for
the exact computation of pC(v|h,v<t) and pR(h|v,v<t).
Some approximations have been made to facilitate efficient
training and inference, more details are available in (Taylor
et al., 2011). A CRBM defines a probability distribution pC
as a Gibbs distribution (1).
pC(vt,ht|v<t) = exp[−EC(vt,ht|v<t)]/Z(θ). (1)
The energy function EC(vt,ht|v<t) in (2) is defined in a
manner similar to that of the RBM.
EC-Real(vt,ht|v<t) = −
∑
i(ci − vi,t)2/2−
∑
j djhj,t
−∑i,j vi,twi,jhj,t,
EC-Binary(vt,ht|v<t) = −
∑
i civi,t −
∑
j djhj,t
−∑i,j vi,twi,jhj,t,
EC-Count(vt,ht|v<t) = −
∑
i(civi,t − log(vi,t!))
−∑j djhj,t −∑i,j vi,twi,jhj,t,
(2)
The probability distributions for the visible nodes are de-
fined in (3),
pC-Real(vi,t|ht,v<t) = N (ci +
∑
j hj,twi,j , 1),
pC-Binary(vi,t = 1|ht,v<t) = σ(ci +
∑
j hj,twi,j),
pC-Count(vi,t|ht,v<t) = P(m, exp(ci +
∑
j hjwij)),
(3)
where, N is a normal distribution, σ is a sigmoid distribu-
tion, and P is a Poisson distribution. The hidden nodes is
defined in (4),
pC(hj,t = 1|vt,v<t) = σ(dj +
∑
i
vi,twi,j). (4)
ci = ai +
∑
p
Ap,ivp,<t, dj = bj +
∑
p
Bp,jvp,<t. (5)
Discriminative CRBMs: DCRBMs are based on the
model in (Larochelle & Bengio, 2008), generalized to ac-
count for temporal phenomenon using CRBMs. DCRBMs
are a simpler version of the Factored Conditional Re-
stricted Boltzmann Machines (Taylor et al., 2011) and
Gated Restricted Boltzmann Machines (Memisevic & Hin-
ton, 2007). Both these models incorporate labels in learn-
ing representations, however, they use a more complicated
GPU Activity Prediction using Representation Learning
potential which involves three way connections into fac-
tors. DCRBMs define the probability distribution pDC as a
Gibbs distribution (6).
pDC(yt,vt,ht|v<t;θ) = 1Z(θ) exp[−EDC(yt,vt,ht|v<t)] (6)
The hidden layer h is defined as a function of the labels y
and the visible nodes v. A new probability distribution for
the classifier is defined to relate the label y to the hidden
nodes h as in (7), as well as relate h to y as in (8). The new
energy function EDC is shown in (9).
pDC(hj,t = 1|yt,vt,v<t) = σ(dj + uj,k +
∑
i vi,twij), (7)
pDC(yl,t|ht) =
exp[sl +
∑
j uj,lhj,t]∑
l∗ exp[sl∗ +
∑
j uj,l∗hj,t]
(8)
EDC(yt,vt,ht|v<t) = EC(vt,ht|v<t)︸ ︷︷ ︸
Generative
−
∑
j,l
hj,tujlyl,t −
∑
l
slyl,t︸ ︷︷ ︸
Discriminative
(9)
Count-DCRBMs: We extend the DCRBM to CountD-
CRBM Figure 1. Count-DCRBMs are based on the model
in (Salakhutdinov & Hinton, 2009), generalized to account
for temporal phenomenon using CRBMs, and discrimina-
tive classification. Count-DCRBMs are used to model time
varying histograms of counts. The probability distribution
over the visible layer will follow a constrained Poisson dis-
tribution, pC-Count(vi,t|ht,v<t) defined in (3), the hidden
layer follows (7) and the label layer follows (8) and the en-
ergy function EC-Count(vt,ht|v<t) defined in (9).
4. Inference and Learning
Inference: to perform classification at time t in the Count-
DCRBM given v<t and vt we use a bottom-up approach,
computing a cost for each possible label yt then choosing
the label with least cost. We compute the cost for label
yt to be the free energy − log pDC(yt,vt|v<t) computed
by marginalizing over h<t and ht. Then, the cost asso-
ciated with the candidate label is the free energy in the
CountDCRBM, namely− log pDC(yt,ht|h<t) is tractable,
because the sum over exponentially many terms can be al-
gebraically eliminated.
Learning: the parameters our model could be learned us-
ing Contrastive Divergence (CD) (Hinton, 2002), where
〈·〉data is the expectation with respect to the data distri-
bution and 〈·〉recon is the expectation with respect to the
reconstructed data. The learning is done using two steps
a bottom-up pass and a top-down pass using sampling
equations from (3), (7), and (8). Bottom-up: the recon-
struction is generated by first sampling the hidden layer
p(ht,j = 1|vt,v<t, yl) for all the hidden nodes in paral-
lel. Top-down: This is followed by sampling the visible
nodes p(vi,t|ht,v<t) and p(yl,t|ht,h<t) for all the visible
nodes in parallel.
5. Experiments
We used the open-source simulator GPGPU-Sim (Bakhoda
et al., 2009) to generate data to validate our approach. The
simulator has been verified rigorously for accuracy against
on a suite of 80 microbenchmarks (Leng et al., 2013). We
used the BACKProp problem from the RODINIA bench-
mark (Che et al., 2010), and simulate a NVIDIA GTX480
GPU with the default configurations for GPGPU-Sim. This
benchmark CUDA program trains a feedforward neural
network with one hidden layer consisting of 4096 units.
To generate our dataset, we modified GPGPU-Sim to re-
trieve the time-indexed list of instruction mix, ie. for each
time cycle the number of different instruction types based
on opcode. These are the visible units in our model. We
tested our approach on three different caches (Instruction
(IC), Data Read (D R), Data Write (D W)) localized within
one core of the GPU. For each cache, GPGPU-Sim out-
puts a list of time-indexed binary labels corresponding to
whether a cache miss occured. Since we want to predict
a cache miss ahead of time, we aggregated the labels over
128 cycles so a label of y(t) = 1 means that a cache miss
occurred in cycles [t, t+ 128].
The Count-DCRBM was trained on a Tesla K20C GPU us-
ing Contrastive Divergence with a constant learning rate of
10−5. Table 1 shows the final accuracies of a model with
15 hidden nodes and varying temporal history available for
DCRBM. The second and third columns are metrics that
describe predictive power, taking into account false posi-
tives and negatives. We observe high accuracy and predic-
tive power of the model for all three caches. We also ob-
serve that increased history generally leads to better perfor-
mance despite the increased model complexity. Our base-
line is an SVM that uses the raw instruction mix as features
without any temporal history.
Model accuracy can be misleading because cache miss
events are rare (e.g. about 10% for IC). Figure 2 (Top)
shows these metrics over training epochs for data write
cache. Note that the initial model accuracy is already about
70% where the model predicts that cache miss never oc-
curs, with a corresponding metric F1 and Mathew Correla-
tion Coefficient (MCC) value of zero. As training epochs
increase, we note a sharp increase in predictive power
around 5000 epochs. We also show the reconstruction error
in Figure 2 (Middle), the objective value for training, over
epochs for the data write cache. We observe that the recon-
struction error significantly drops in the first 20k epochs.
Figure 2 (Bottom) shows a measure of the classification er-
ror, measured in terms of the binary cross entropy between
the true and predicted labels. We also observe that the clas-
sification error continues to drop steadily even though the
reconstruction error has converged, showing that the model
accounts for label information. Figure 3 shows the pre-
GPU Activity Prediction using Representation Learning
Cache Model MCC F1 Accuracy
DC R
DCRBM(1) 0.32 0.59 0.67
DCRBM(5) 0.35 0.63 0.68
DCRBM(10) 0.39 0.66 0.69
SVM(Poly) 0.34 0.44 0.83
SVM(RBF) 0.41 0.51 0.83
DC W
DCRBM(1) 0.36 0.58 0.70
DCRBM(5) 0.36 0.59 0.65
DCRBM(10) 0.32 0.58 0.65
SVM(Poly) 0.27 0.31 0.64
SVM(RBF) 0.32 0.55 0.68
IC
DCRBM(1) 0.32 0.40 0.86
DCRBM(5) 0.32 0.39 0.87
DCRBM(10) 0.37 0.44 0.85
SVM(Poly) 0 0 0.99
SVM(RBF) 0 0 0.98
Table 1. Scores vs Models for different types of cache. The larger
the history for DCRBM, the higher the complexity and training
difficulty of the model. Larger history is better except in the case
of Data Write.
diction using a history of 10 cycles, in comparison with the
ground truth. Future work includes validating our approach
across microbenchmarks.
6. Conclusions
Our approach has significant implications for the GPU rev-
olution of computing. A data driven approach can poten-
tially identify mix of instructions that cause performance
bottlenecks. Although we focused on cache misses, any
statistic of interest to the computer architecture community
such as power consumption and voltage can potentially be
predicted. The extension to an online embedded setting
is straightforward and could potentially save computation
time. Prediction of performance bottlenecks is a step to-
wards a cognitive processor architecture.
Acknowledgments
This research is partially funded under NSF #1526399, the
Defense Advanced Research Projects Agency (DARPA)
and the Air Force Research Laboratory (AFRL). The views,
opinions and/or findings expressed are those of the authors
and should not be interpreted as representing the official
views or policies of the Department of Defense or the U.S.
Government.
Figure 2. (Top) Accuracy (dark blue), precision light in
light blue, recall in red, MCC in yellow, F1 in grey, (Mid-
dle) Reconstruction Error, (Bottom) Classification Error
Figure 3. (Top) Ground truth labels, (Bottom) Prediction
GPU Activity Prediction using Representation Learning
References
Bakhoda, Ali, Yuan, George L, Fung, Wilson WL, Wong,
Henry, and Aamodt, Tor M. Analyzing Cuda Work-
loads Using a Detailed GPU Simulator. In IEEE Interna-
tional Symposium on Performance Analysis of Systems
and Software (ISPASS), 2009.
Ball, Thomas and Larus., James R. Branch Prediction for
Free. In ACM, 1993.
Bergman, Keren et al. ExaScale Computing Study: Tech-
nology Challenges in Achieving Exascale Systems. vol-
ume 15, 2008.
Blem, Emily, Sinclair, Matthew, and Sankaralingam,
Karthikeyan. Challenge Benchmarks that must be Con-
quered to Sustain the GPU Revolution. CELL, 2011.
Calder, Brad and et al. Evidence-based Static Branch Pre-
diction using Machine Learning. In ACM Transactions
on Programming Languages and Systems (TOPLAS),
1997.
Che, Shuai, Sheaffer, Jeremy W, Boyer, Michael, Szafaryn,
Lukasz G, Wang, Liang, and Skadron, Kevin. A Char-
acterization of the Rodinia Benchmark Suite with Com-
parison to Contemporary CMP Workloads. In IISWC,
2010.
Chen, Xuhao, Chang, Li-Wen, Rodrigues, Christopher I,
Lv, Jie, Wang, Zhiying, and Hwu, Wen-Mei. Adaptive
Cache Management for Energy-efficient GPU Comput-
ing. In Microarchitecture, 2014.
Esmaeilzadeh, Hadi and et. al. Dark Silicon and the End of
Multicore Scaling. In Proceedings of the International
Symposium on Computer Architecture (ISCA), 2011.
Fung, Wilson WL, Sham, Ivan, Yuan, George, and Aamodt,
Tor M. Dynamic Warp Formation and Scheduling for
Efficient GPU Control Flow. In Proceedings of the
40th Annual IEEE/ACM International Symposium on
Microarchitecture, 2007.
Hinton, Geoffrey E. Training Products of Experts by Min-
imizing Contrastive Divergence. MIT Press, 2002.
Hinton, Geoffrey E, Osindero, Simon, and Teh, Yee-Whye.
A Fast Learning Algorithm for Deep Belief Nets. MIT
Press, 2006.
Jime´nez, Daniel A and Lin, Calvin. Dynamic Branch Pre-
diction with Perceptrons. In High-Performance Com-
puter Architecture, 2001. HPCA. The Seventh Interna-
tional Symposium on. IEEE, 2001.
Larochelle, H. and Bengio, Y. Classification using Dis-
criminative Restricted Boltzmann Machines. In ICML,
2008.
Lee, Jaekyu, Lakshminarayana, Nagesh B, Kim, Hyesoon,
and Vuduc, Richard. Many-thread Aware Prefetching
Mechanisms for GPGPU Applications. In Microarchi-
tecture, 2010.
Leng, Jingwen, Hetherington, Tayler, El Tantawy, Ahmed,
Gilani, Syed, Kim, Nam Sung, Aamodt, Tor M, and
Reddi, Vijay Janapa. GPUWattch:Enabling Energy Op-
timizations in GPGPUs. ACM SIGARCH, 2013.
Leng, Jingwen, Buyuktosunoglu, Alper, Bertran, Ramon,
Bose, Pradip, and Reddi, Vijay Janapa. Safe Limits on
Voltage Reduction Efficiency in GPUs: A Direct Mea-
surement Approach. In Microarchitecture. ACM, 2015.
Li, Dong, Rhu, Minsoo, Johnson, Daniel R, O’Connor,
Mike, Erez, Mattan, Burger, Doug, Fussell, Donald S,
and Redder, Stephen W. Priority-based Cache Alloca-
tion in Throughput Processors. In HPCA, 2015.
Memisevic, R. and Hinton, G. E. Unsupervised Learning
of Image Transformations. In CVPR, 2007.
Mohamed, A. R. and Hinton, G. E. Phone Recognition
using Restricted Boltzmann Machines. In ICASSP, 2009.
Patterson, David. Future of Computer Architecture. In
Berkeley EECS Annual Research Symposium, 2006.
S. Chai, et al. Lightweight Detection and Recovery Mech-
anisms to Extend Algorithm Resiliency in Noisy Com-
putation. In WNTC, 2014.
Salakhutdinov, R. and Hinton, G. E. Reducing the Di-
mensionality of Data with Neural Networks. In Science,
2006.
Salakhutdinov, Ruslan and Hinton, Geoffrey. Semantic
Hashing. In International Journal of Approximate Rea-
soning, 2009.
Sutskever, I. and Hinton, G. E. Learning Multilevel
Distributed Representations for High-Dimensional Se-
quences. In AISTATS, 2007.
Taylor, Graham W., Hinton, Geoffrey E., and Roweis,
Sam T. Two Distributed-State Models For Generating
High-Dimensional Time Series. In Journal of Machine
Learning Research, 2011.
Tiwari, Vivek, Malik, Sharad, and Wolfe, Andrew. Power
Analysis of Embedded Software: A First Step Towards
Software Power Minimization. VLSI Systems, 1994.
Wulf, Wm A. and McKee, Sally A. Hitting the Memory
Wall: Implications of the Obvious. In ACM SIGARCH
computer architecture news, 1995.
Yeh, Tse-Yu and Patt, Yale N. Two-level Adaptive Training
Branch Prediction. In ACM ISM, 1991.
