GPU activity prediction is an important and complex problem. This is due to the high level of contention among thousands of parallel threads. This problem was mostly addressed using heuristics. We propose a representation learning approach to address this problem. We model any performance metric as a temporal function of the executed instructions with the intuition that the flow of instructions can be identified as distinct activities of the code. Our experiments show high accuracy and non-trivial predictive power of representation learning on a benchmark.
Introduction
The performance of a computing system relies on the sustained operational throughput. Sustained operation is becoming harder to achieve as computation workloads become more complex. At the same time, with the end of Dennard scaling (Esmaeilzadeh & et. al., 2011) , and the increasing abundance of Big Data, it is imperative to minimize wasted processor effort in order to achieve processor reliability and scalability (Wulf & McKee, 1995; Patterson, 2006; Bergman et al., 2008) .
The goal of this paper is to demonstrate the efficacy of machine learning to designing computing systems. We anticipate that classical problems such as branch predictions and cache management can be re-evaluated such that heuristically based approaches (Yeh & Patt, 1991; Fung et al., 2007; Li et al., 2015) can be replaced with a machine learning approach (Leng et al., 2015) . It is well understood in the computer architecture community that processor behavior is highly complex and data dependent. Processor data is widely available in the form of benchmarks (Che et al., 2010) , and algorithms are extensively compared using these benchmarks (Blem et al., 2011) .
We choose a well-understood and well-defined problem of predicting GPU Cache Misses (Li et al., 2015; Chen et al., 2014) for this paper. General Purpose GPU (GPGPU) achieve high throughput execution via a high level of parallelism. Predicting GPU Cache Misses is complex due to the high level of contention among thousands of threads. Cache contention is a bottleneck for parallel execution when many threads are waiting for cache operation, causing the addition of more threads (or cores) to be detrimental. Predicting whether a cache miss is about to occur is useful for better cache management such as cache bypassing (Chen et al., 2014) , pre-fetching (Lee et al., 2010) , prioritized allocation (Li et al., 2015) etc. Further, cache misses indirectly cause increased energy and power usage (Leng et al., 2015) because of second order effects beyond memory latency. In principle, our approach is amenable to predict these higher order events (such as voltage scaling (Leng et al., 2015) and faults (S. Chai, 2014)) either directly (Tiwari et al., 1994) or via hierarchical modeling.
We propose a new model that can predict key processor events that limit processor throughput. We propose a new variant of the Conditional Restricted Boltzmann Machines (CRBMs) (Taylor et al., 2011) to directly address system performance and reliability. CRBMs efficiently model short-term temporal phenomenon. Prior work used a perceptron to predict cache misses (Leng et al., 2013) . Unlike their approach, our model accounts for time-series and count data.
Our approach assumes the availability of a simulator for CUDA (Bakhoda et al., 2009 ) that can generate a dataset for training our model. In principle, this approach can be used in real-time by incrementally augmenting the dataset. Multiple repeated executions can even lead to increased predictive power because more data is available for machine learning. Our predictor is naturally agnostic to the hardware and architecture as it relies on execution traces.
Our contributions:
• Prediction of processor events as temporally-extended activities in a stream of instructions.
• (Ball & Larus., 1993) . Dynamic solutions (Yeh & Patt, 1991) are more complicated as they are implemented with counters and tables to store branch history based on branch memory address. Other approaches that are data-driven use perceptrons (Jiménez & Lin, 2001 ) and feed-forward neural networks (Calder & et al., 1997) .
Representation Learning: Restricted Boltzmann Machines (RBMs) form the building blocks in energy based deep networks Salakhutdinov & Hinton, 2006) . Recently, temporal models based on deep networks have been proposed, capable of modeling a more temporally rich set of problems. These include Conditional RBMs (CRBMs) (Taylor et al., 2011) and Temporal RBMs (TRBMs) (Sutskever & Hinton, 2007) . CRBMs have been used in both visual (Taylor et al., 2011) and audio (Mohamed & Hinton, 2009 ). In addition to efficiently modeling time-series data, RBMs were formulated to be trained discriminatively for classification (Larochelle & Bengio, 2008) , and model word-count vectors from a large set of documents (Salakhutdinov & Hinton, 2009 ).
Model
The input to our model (called visible units) is an instruction mix per time step, ie. the histogram of counts of instructions being excuted, obtained from the GPU simulator. The labels are any chosen performance metric also output by the simulator.
We discuss a sequence of models, gradually increasing in complexity, so that the different components of our model can be understood in isolation. We start with the basic CRBM model, then we extend to the discriminative DCRBM, and finally CountDCRBM.
Conditional Restricted Boltzmann Machines: CRBMs (Taylor et al., 2011) , are a natural extension of RBMs for modeling short term temporal dependencies. A CRBM is an RBM which takes into account history from the previous time instances t − N, . . . , t − 1 at time t. This is done by treating the previous time instances as additional inputs. Doing so does not complicate inference. v is a vector of visible nodes, h is a vector of hidden nodes, and v <t is the visible vectors from the previous N time instances, which influences the current visible and hidden vectors. E C is the energy function, and Z is the partition function. The parameters θ to be learned are a and b the biases for v and h respectively and the weights W . A and B are matrices of concatenated vectors of previous time instances of a and b. The CRBM is fully connected between layers, with no lateral connections. This architecture implies that v and h are factorial given one of the two vectors. This allows for the exact computation of p C (v|h, v <t ) and p R (h|v, v <t ). Some approximations have been made to facilitate efficient training and inference, more details are available in (Taylor et al., 2011) . A CRBM defines a probability distribution p C as a Gibbs distribution (1).
The energy function E C (v t , h t |v <t ) in (2) is defined in a manner similar to that of the RBM.
The probability distributions for the visible nodes are defined in (3),
where, N is a normal distribution, σ is a sigmoid distribution, and P is a Poisson distribution. The hidden nodes is defined in (4),
Discriminative CRBMs: DCRBMs are based on the model in (Larochelle & Bengio, 2008) , generalized to account for temporal phenomenon using CRBMs. DCRBMs are a simpler version of the Factored Conditional Restricted Boltzmann Machines (Taylor et al., 2011) and Gated Restricted Boltzmann Machines (Memisevic & Hinton, 2007) . Both these models incorporate labels in learning representations, however, they use a more complicated potential which involves three way connections into factors. DCRBMs define the probability distribution p DC as a Gibbs distribution (6).
The hidden layer h is defined as a function of the labels y and the visible nodes v. A new probability distribution for the classifier is defined to relate the label y to the hidden nodes h as in (7), as well as relate h to y as in (8). The new energy function E DC is shown in (9).
Count-DCRBMs: We extend the DCRBM to CountD-CRBM Figure 1 . Count-DCRBMs are based on the model in (Salakhutdinov & Hinton, 2009 ), generalized to account for temporal phenomenon using CRBMs, and discriminative classification. Count-DCRBMs are used to model time varying histograms of counts. The probability distribution over the visible layer will follow a constrained Poisson distribution, p C-Count (v i,t |h t , v <t ) defined in (3), the hidden layer follows (7) and the label layer follows (8) and the energy function E C-Count (v t , h t |v <t ) defined in (9).
Inference and Learning
Inference: to perform classification at time t in the Count-DCRBM given v <t and v t we use a bottom-up approach, computing a cost for each possible label y t then choosing the label with least cost. We compute the cost for label y t to be the free energy − log p DC (y t , v t |v <t ) computed by marginalizing over h <t and h t . Then, the cost associated with the candidate label is the free energy in the CountDCRBM, namely − log p DC (y t , h t |h <t ) is tractable, because the sum over exponentially many terms can be algebraically eliminated.
Learning: the parameters our model could be learned using Contrastive Divergence (CD) (Hinton, 2002) , where · data is the expectation with respect to the data distribution and · recon is the expectation with respect to the reconstructed data. The learning is done using two steps a bottom-up pass and a top-down pass using sampling equations from (3), (7), and (8). Bottom-up: the reconstruction is generated by first sampling the hidden layer p(h t,j = 1|v t , v <t , y l ) for all the hidden nodes in parallel. Top-down: This is followed by sampling the visible nodes p(v i,t |h t , v <t ) and p(y l,t |h t , h <t ) for all the visible nodes in parallel.
Experiments
We used the open-source simulator GPGPU-Sim (Bakhoda et al., 2009) to generate data to validate our approach. The simulator has been verified rigorously for accuracy against on a suite of 80 microbenchmarks (Leng et al., 2013) . We used the BACKProp problem from the RODINIA benchmark (Che et al., 2010) , and simulate a NVIDIA GTX480 GPU with the default configurations for GPGPU-Sim. This benchmark CUDA program trains a feedforward neural network with one hidden layer consisting of 4096 units.
To generate our dataset, we modified GPGPU-Sim to retrieve the time-indexed list of instruction mix, ie. for each time cycle the number of different instruction types based on opcode. These are the visible units in our model. We tested our approach on three different caches (Instruction (IC), Data Read (D R), Data Write (D W)) localized within one core of the GPU. For each cache, GPGPU-Sim outputs a list of time-indexed binary labels corresponding to whether a cache miss occured. Since we want to predict a cache miss ahead of time, we aggregated the labels over 128 cycles so a label of y(t) = 1 means that a cache miss occurred in cycles [t, t + 128].
The Count-DCRBM was trained on a Tesla K20C GPU using Contrastive Divergence with a constant learning rate of 10 −5 . Table 1 shows the final accuracies of a model with 15 hidden nodes and varying temporal history available for DCRBM. The second and third columns are metrics that describe predictive power, taking into account false positives and negatives. We observe high accuracy and predictive power of the model for all three caches. We also observe that increased history generally leads to better performance despite the increased model complexity. Our baseline is an SVM that uses the raw instruction mix as features without any temporal history.
Model accuracy can be misleading because cache miss events are rare (e.g. about 10% for IC). Figure 2 (Top) shows these metrics over training epochs for data write cache. Note that the initial model accuracy is already about 70% where the model predicts that cache miss never occurs, with a corresponding metric F1 and Mathew Correlation Coefficient (MCC) value of zero. As training epochs increase, we note a sharp increase in predictive power around 5000 epochs. We also show the reconstruction error in Figure 2 (Middle), the objective value for training, over epochs for the data write cache. We observe that the reconstruction error significantly drops in the first 20k epochs. Figure 2 (Bottom) shows a measure of the classification error, measured in terms of the binary cross entropy between the true and predicted labels. We also observe that the classification error continues to drop steadily even though the reconstruction error has converged, showing that the model accounts for label information. Figure 3 shows diction using a history of 10 cycles, in comparison with the ground truth. Future work includes validating our approach across microbenchmarks.
Conclusions
Our approach has significant implications for the GPU revolution of computing. A data driven approach can potentially identify mix of instructions that cause performance bottlenecks. Although we focused on cache misses, any statistic of interest to the computer architecture community such as power consumption and voltage can potentially be predicted. The extension to an online embedded setting is straightforward and could potentially save computation time. Prediction of performance bottlenecks is a step towards a cognitive processor architecture. 
