Abstract-The growing security threat of microarchitectural attacks underlines the importance of robust security sensors and detection mechanisms at the hardware level. While there are studies on runtime detection of cache attacks, a generic model to consider the broad range of existing and future attacks is missing. Unfortunately, previous approaches only consider either a single attack variant, e.g. Prime+Probe, or specific victim applications such as cryptographic implementations. Furthermore, the stateof-the art anomaly detection methods are based on coarse-grained statistical models, which are not successful to detect anomalies in a large-scale real world systems.
I. INTRODUCTION
In the past decade, we have witnessed the evolution of microarchitectural side-channel attacks [15] , [18] , [29] , [30] , [64] , from being considered as a nuisance and largely dismissed by chip manufacturers to becoming frontpage news. The severity of the threat was demonstrated by the Spectre [31] and Meltdown [33] attacks, which allow a user with minimum access right to easily read arbitrary locations in the memory by exploiting the transient effect of illegal instruction sequences. This was followed by a plethora of attacks [35] , [39] , [47] , [56] either extending the scope of the microarchitectural flaws or identifying new leakage sources. It is noteworthy that these critical vulnerabilities managed to stay hidden for decades. Only after years of experimentation, researchers managed to gain sufficient insight into, for the most part, the unpublished aspects of these platforms. This leads to the point that they could formulate fairly simple but very subtle attacks to recover internal secrets. Therefore, the natural question becomes: how can we discover dormant vulnerabilities and protect against such subtle attacks? A fundamental approach is to eliminate the leakage altogether by using formal analysis. However, given the tremendous level of complexity of modern computing platforms and lack of public documentation, formal analysis of the hardware seems impractical in the near future. What remains is the modus operandi: leaks are patched as they are discovered by researchers through inspection and statistical analysis.
Countermeasures for microarchitectural side-channel attacks focus on the operating system (OS) hardening [17] , [34] , software synthesis [8] , [44] and analysis [4] , [60] , [61] , and static [28] or dynamic [7] , [10] , [67] detection of attacks. Static analysis is performed by evaluating the untrusted software against known malicious code patterns without running it on a target platform [28] . Alternatively, dynamic analysis aims to detect malicious behaviors in the system by analyzing the runtime footprint of the running processes [10] . Existing works on dynamic detection of microarchitectural attacks are based on collecting footprints from the hardware performance counters (HPCs) and limited modeling of malicious behaviors [7] , [10] , [41] , [67] . A crucial challenge for both detection techniques is the shortage of knowledge about new attack vectors. Therefore, modeling malicious behaviors for undiscovered attacks and accurately distinguishing them from benign activities are open problems. Moreover, microarchitectural attacks are in infancy, and supervised learning models, which are used as attack classifier [41] , are not reliable to detect known attacks due to the insufficient amount and imprecise labeling of the data. Hence, unsupervised methods are more promising to adapt the detection models to real world scenarios.
Anomaly-based attack detection, which has been also studied in other security applications [16] , [49] , aims to address the aforementioned challenge by only modeling the benign behaviors and detecting outliers. While there have been several efforts on anomaly-based detection of cache attacks [7] , [10] , modern microarchitectures have a diverse set of components that suffers from side-channel attacks [15] , [18] , [40] , [65] . Thus, detection techniques will not be practical and usable, if they do not cover a broad range of both known and unseen attacks. This requires more advanced learning algorithms to comprehensively model the entire behavior of the microarchitecture. On the other hand, statistical methods for anomaly detection are not sufficient to analyze millions of events that are collected from a very complex system like the modern microarchiecture. A major limitation of the classical statistical
arXiv:1907.03651v1 [cs.CR] 8 Jul 2019
learning methods is that they use a hand-picked set of features, which wastes the valuable information to characterize the benign execution patterns. As a result, these techniques fail at building a generic model for real-world systems.
The latest advancements in Deep Learning, especially in Recurrent Neural Networks (RNNs), shows that time dependent tasks such as language modeling [53] , speech recognition [46] can be learned and upcoming sequences are predicted more efficiently by training millions of data samples. Similarly, computer programs are translated to processor instructions, and the corresponding microarchitectural events have time dependent behaviors. Modeling the sequential flow of these events for benign applications is extremely difficult by using logic and formal reasoning due to the complexity of the modern microarchitecture. We claim that these time dependent behaviors can be modeled in a large scale by observing sufficient number of benign execution flows. Since the longterm dependencies in the time domain can be learned with a high accuracy by training Long-short term memory (LSTM) and Gated Recurrent Unit (GRU) networks, the fingerprint of benign applications in a processor can also be learned efficiently. In addition, a challenging task of choosing the features of benign applications can be done automatically by LSTM/GRU networks in the training phase without any expert input.
Our Contribution: We propose FortuneTeller which is the first generic detection model/technique for microarchitectural attacks. FortuneTeller learns the benign behavior of hardware/-software systems by observing microarchitectural events, and classifies any outlier that does not conform to the trained model as malicious behavior. FortuneTeller can detect unseen microarchitectural attacks, since it only requires training over benign execution patterns.
In summary, we propose FortuneTeller which:
• is a generic detection technique, that can be applied to detect attacks on other microarchitectures and execution environments.
• for the first time, can detect various attacks automatically, disregarding the victim application, including cryptogrpahic implementations, browser passwords, secret data in kernel environment, bit flips and so on.
• can detect attacks that were not observed during the training, or future attacks that may be introduced by the security community.
More specifically, we show:
1) different types of hardware performance counters can be used as the most optimum security sensor available on the commodity processors. 2) how to capture the system-wide low-level microarchitectural traces and learn noisy time-dependent sequences through advanced RNN algorithms by training a more advanced and generic model. 3) FortuneTeller performs better by comparing it to the state-of-the art detection techniques.
4) we can detect malicious behavior dynamically in an unsupervised manner including stealthy cache attacks (Flush+Flush), transient execution attacks (Meltdown, Spectre, Zombieload) and Rowhammer.
A. Outline:
The rest of the paper is organized as follows: Section II provides background information on microarchitecural attacks, performance counters and RNNs. Then, Section III gives an overview of previous works. Section IV outlines the methodology and implementation of FortuneTeller. Also, information on our benign and attack dataset and performance counter selection are given. Section V evaluates the results. The comparison with the prior works is given in Section VI. Finally, Section VII discusses the results and Section VIII concludes our work.
II. BACKGROUND

A. Microarchitectural Attacks
Modern computer architecture has a tremendously complex and optimized design. In order to improve the performance, several low-level features have been introduced such as speculative branches, out-of-order executions and shared last level cache (LLC). All these components are potential targets for microarchitectural attacks. Therefore, the following paragraphs give insight into microarchitectural attacks, which are examples of attacks that can be detected by FortuneTeller.
Flush+Reload (F+R)
The LLC is shared among all cores in the processor. Flush+Reload attack [64] aims to track accesses to specific cache lines by using the clflush instruction. First, adversary flushes the victim cache line. Then, the victim executes some instructions. Finally, the adversary reloads the same cache line and measures the access time. Flush+Reload attack is mostly used to recover cryptographic keys [63] , which is applicable to perform attacks on systems with enabled memory deduplication such as cloud environments [21] , [27] . Flush+Flush (F+F) Flush+Flush attack uses the clflush instruction to flush the specific cache lines [20] . Instead of measuring the time to access a cache line, the execution time of the clflush instruction is measured. This method is considered as a stealthy attack against detection methods, since the number of introduced cache misses is low by this attack. Flush+Flush attack is used to exploit the T-table implementation of AES and user's keystrokes [20] . Prime+Probe (P+P) In a Prime+Probe attack, an adversary aims to fill an entire cache set, and then, measures the access time to the same cache set [55] . If a victim evicts any of the adversary's cache line from the set, the adversary will observe access latency which leaks information about the victim's memory access pattern. While it has a lower resolution compared to Flush+Reload and Flush+Flush, it has a broader applicability. Prime+Probe attack was applied in the cloud environment to steal secret keys [24] , [26] , [68] , Javascript to detect the visited webpages [42] and mobile phones to detect applications and user input [22] , [32] . Rowhammer DRAM cells have the possibility to leak charge over time. Rowhammer [19] triggers the leak by accessing neighboring rows repeatedly. This leads to bit flips, which enables adversaries with low access right to gain system privileges [48] . clflush instruction is also commonly used to increase repeated access to the DRAM by bypassing the cache [5] .
Spectre Spectre attacks exploit speculative branches [31] . This attack is able to read memory addresses, which do not belong to the adversary by misusing the branch prediction. Therefore, sensitive data such as credentials stored in the browser can be leaked from the victim's memory space. Spectre is also effective against the SGX environment to compromise the trusted execution [9] , [56] .
Meltdown Meltdown attack focuses on out-of-order execution to read kernel memory addresses [33] . The victim's secret, which is loaded into the registers will be mapped to different cache lines. Flush+Reload is used to determine if a specific cache line has been accessed. An adversary with only user privileges can perform this attack to read the content of kernel address space. The same concept has also been applied to Intel SGX [56] to bypass the hardware supported memory isolation.
ZombieLoad Meltdown-style attacks can also specifically leak data from various microarchitectural resources such as store buffer [39] , line fill buffer (LFB) [47] and load ports [57] . ZombileLoad attack leaks data from memory load operations that are executed in other user processes and kernel context. Faulting/assisting loads that are executed by a malicious process can retrieve the stale data belonging to other security domains. This data that has been falsely forwarded from the shared resources may include secrets such as cryptographic keys or website URLs which can be transmitted over a covert channel such as the Flush+Reload technique.
Hardware Performance Counters (HPCs) store low-level hardware-related events in the CPU. These events are tracked as counters that are available through special purpose registers. There are various performance events available in processors. The counters are used to collect information about the system behavior while an application is running. They have been used by researchers to reverse-engineer the internal design choices in the processor [37] , or to increase the performance of the software by analyzing the bottlenecks [3] . These low-level counters are provided on all major architectures developed by ARM, Intel, AMD, and NVIDIA.
There are various tools to program and read performance counters. Intel PCM [25] supports both core and offcore counters on Intel processors. The core counters give access to events within a single core of a processor, while the offcore counters profile events and activities across the cores and within the processor's die. This includes some of the events related to the integrated memory controller and the Intel QuickPath Interconnect which is shared by all cores. Before using the performance counters, we need to program each of them to monitor a specific event. Afterwards, the counter state can be sampled. In this work, we only focused on core counters, since the offcore counters have a small variety.
C. Recurrent Neural Networks (RNNs)
RNNs are a type of Artificial Netural Network algorithm, which is used to learn and predict the sequential data. RNNs are mostly applied to speech recognition and currently used by Apple's Siri [52] and Google's Voice Search [11] . The reason behind the integration of RNNs into real world applications is that it is the first algorithm to remember the temporal relations in the input through its internal memory. Therefore, RNNs are mostly preferred for tasks where sequential data is involved.
In a typical RNN structure, the information cycles through a loop. When the algorithm needs to make a decision, it uses the current input x t and hidden state h t−1 where the learned features from the previous data samples are kept as shown in Figure 1a . Basically, a RNN algorithm produces output based on the previous data samples, and provides the output as a feedback into the network. However, traditional RNN algorithms are not good at learning the long-term sequences because the amount of extracted information converges to zero with the increasing time steps. In other words, the gradient is vanished and the model stops learning after long sequences. In order to overcome this problem, two algorithms were introduced, as described below:
1) Long-Short Term Memory: Long-Short Term Memory (LSTM) networks are modified RNNs, which essentially extends the internal memory to learn longer time sequences.
LSTM networks consist of memory cell, input, forget and output gates as shown in Figure 1b . The memory cell keeps the learned information from the previous sequences. While the cell state is modified by the forget gate, the output of the forget gate multiplies the specific positions in the input matrix by 0 to forget and by 1 to keep the information. The forget gate equation is as follow;
, where sigmoid function is applied to the weighted input and the previous hidden state. In the input gate, the useful input sections are determined to be fed into the cell state. The input gate equation
where sigmoid function is used as an activation function. This gate is combined with the input modulation gate to switch the cell state to forget memory. The activation function for input modulation gate is tanh. Finally, the output gate passes the output to the next hidden state by applying the equation
, where tanh is used as an activation function. Therefore, LSTM networks can select distinct features in the time sequence data more efficiently than RNNs, which enables learning the long-term temporal relations in the input.
2) Gated Recurrent Unit: Gated Recurrent Unit (GRU) is improved version of RNNs. GRU uses two gates called, update gate and reset gate. The update gate uses the following equation: z t = σ(W z x t + U z ht − 1). Basically, both current input and the previous hidden state are multiplied with their own weights and added together. Then, a sigmoid activation function is applied to map the data between 0 and 1. The importance of the update gate is to determine the amount of the past information to be passed along to the future. Then, the reset gate is used to decide how much of the past information to forget. In order to calculate how much to forget, r t = σ(W r x t +U r h t−1 ) equation is used, where the previous hidden state and current input are multiplied with their corresponding weights. Then, the results are summed and sigmoid function is applied. The output is passed to the current memory cell which stores the relevant information from the past. It is calculated as h t = tanh(W x t + r t U h t−1 ). The element-wise product between reset gate and weighted previous hidden layer state determines the information to be removed from previous time steps. Finally, the current information is calculated by the equation h t = z t h t−1 + (1 − z t ) h t . The purpose of this part is to use the information obtained from update gate and combine both reset and update gate information. Hence, while the relevant samples are learned by update gate, the redundant information such as noise is eliminated by reset gate.
In this work, the RNN algorithms are used in an unsupervised fashion where there is no need for separate validation dataset in the training phase. The validation error is calculated for each prediction in the next timestamp and the total validation error is given after each epoch.
III. RELATED WORK A. Detecting Attacks using HPCs
Low-level performance monitoring events such as HPCs have been used as security sensors to detect malicious activities [23] , [36] . Similar to [62] , [66] , Numchecker [58] and Confirm [59] adopt these sensors to detect control flow violations, which are applied to rootkits and firmware modifications, respectively. In addition, classical ML algorithms such as support vector machines (SVMs) and k-nearest neighbors (KNNs) are adapted to naive heuristic-based techniques for multi-class classification [6] , [13] . The latter explores neural network in a supervised fashion [13] . Tang et al. [54] train One-Class Support Vector Machine (OC-SVM) with benign system behavior and detect the malware in the system.
Despite the detection of malware and rootkits in the system, HPCs have also been used to detect microarchitectural attacks. Since our work focuses on microarchitectural attack detection, the features of prior approaches and our detection technique are compared in Table I . Firstly, Chiappetta et al. [10] proposes to monitor HPCs and the data is analyzed by using Gaussian Sampling (GS) or probability density function (pdf) to detect the anomalies on cryptographic implementations dynamically. Later, Zhang et al. [67] apply Dynamic Time Wrapping (DTW) to catch cryptographic implementation executions in the victim VMs. Then, the number of cache misses and hits in the attacker VMs are monitored during the execution of the sensitive operations. Briongos et al. [7] implement Change Point Detection (CPD) technique to determine the sudden changes in the time series data to detect F+F, F+R and P+P attacks. Finally, Mushtaq et al. [41] detect the cache oriented microarchitectural attacks with supervised Linear Discriminant Analysis (LDA), Support Vector Machine (SVM) and Linear Regression (LR) technique under various system loads. We further compare the most related works with FortuneTeller in Section VI.
B. RNN Applications in Security
RNN algorithms are applied to other security domains to increase the efficiency of defensive technologies. For instance, Shin et al. [51] leverage RNNs to identify functions in the program binary. Once the model is trained with these function, the technique classifies the bytes to decide on whether it is the beginning of the function or not. Similarly, Pascanu et al. [43] apply RNNs to detect malware by training the APIs in an unsupervised way. The technique improves the true positive rate by 98% compared to previous studies. In another study, Melicher et al. [38] introduce RNN-based technique to improve guessing attacks on password's resistance. This study shows better accuracy than Markov models. Furthermore, Du et al. [14] implement LSTM based anomaly detection to detect anomalies in the system. The LSTM model is trained with log data obtained from normal execution. Their results show that the traditional data mining techniques underperform LSTM model to detect the anomalies. Finally, Shen et al. [50] apply LSTM and GRU networks to predict the next security events with a precision of up to 0.93. These studies indicate that RNN based security applications are commonly used in other challenging environments. In the offline phase, FortuneTeller collects time sequence data from diverse set of benign applications by monitoring security sensors in the system. The collected data is used as the training data and it is fed into the RNN algorithm with a sliding window technique. The weights of the trained model are optimized by the algorithm itself since each data sample is also used as the validation. When there is no further improvement in the validation error, the training process stops. Once the RNN model is trained, it is ready to be used in a real time system.
In the online phase, the real-time sequences are captured from the same security sensors and given as input to RNN models. The prediction of the next measurement for each sensor is made by the pre-trained RNN models, dynamically. If the mean squared error (MSE) between the predicted value and the real time sensor measurement is consistently higher than a threshold, the anomaly flag is set. The online phase is the actual evaluation of FortuneTeller in a real world system. Two separate detection models are trained with LSTM and GRU networks since they are known for their extraordinary capabilities in learning the long time sequences. Our purpose is to train an RNN-based detection model, which can predict the microarchitectural events in the next time steps with the minimal error. In our detection scenario, we consider a time series X = {x (1) , x (2) , . . . , x (n) }, where each measurement
m } and each element corresponds to a sensor value at time t. As all temporal relations can not be discovered from millions of samples, a sliding window with a size of W is used to partition the data into small chunks. Thereby, the input to RNN algorithm at time step t is {x
where the output is y (t) = x (t+1) . Note that, even though there is a fixed length sliding window in the problem formulation, the overall input size is not fixed. Finally, the trained model is saved to be used in real-world system.
To evaluate the trained model, test dataset is collected from benign applications and attack executions. The test dataset has the same structure with the training data, and is fed into the model to calculate the prediction error in the next time steps. The error at time step t + 1 is e (t+1) which is equal
The model predicts the value of next measurement and then, the error for is summed up to one value.
To detect the anomalies in the system, a decision window D and an anomaly threshold τ A are used. If all the predictions in D are higher than τ A , then the anomaly flag F A is set in the system in Equation 1.
The choice of D directly determines the anomaly detection time. If D is chosen as a small value, the attacks are discovered with a very small leakage. On the other hand, the false alarm risk increases in parallel, which is controlled by adjusting τ A . This trade-off is discussed further in Section V.
B. Implementation 1) Profiled Benchmarks and Attacks:
The main purpose of FortuneTeller is to train a generic model by profiling a diverse set of benign applications. Therefore, selecting benign applications is utmost importance. For the benign application dataset, benchmark tests in Phoronix benchmark suite [1] are profiled since the suite includes different type of applications such as cryptographic implementations, floating point and compression tests, web-server workloads etc. The complete list is given in Appendix, Table V . It is important to note that some benchmark tests have multiple sub-tests and all the sub-tests are included in both training and test phases. In addition to CPU benchmarks, we evaluate our detection models against system, disk and memory test benchmarks. In order to increase the diversity, the daily applications such web browsing, video rendering, Apache server, MySQL database and Office applications with several parameters are profiled for real-world examples.
A subset of benign execution data is used to train our RNN models and then, the whole benign dataset is used to calculate the FPR (False Positive Rate) and TNR (True Negative Rate) of the models. In our work, FPs represents the benign applications which are classified as an attack/anomaly by the RNN model. If the benign application does not raise the alarm flag, it is considered as TNs.
For the attack executions we include traditional cache attack techniques such as F+F, F+R and P+P attacks. Different from previous works, these attacks are applied on arbitrary memory blocks to avoid any assumption on the target implementation. Spectre (v1 and v2) and Meltdown are implemented to read secrets such as passwords in a pre-determined memory location. In addition, two types of Rowhammer attacks namely, one-sided and double-sided, are applied to have bit flips. In order to test the efficiency of FortuneTeller we implemented a recent microarchitectural attack, Zombieload, to steal data across processes. For this purpose, a victim thread leaks predetermined ASCII characters and the attacker reads the line-fill buffer to recover the secret. If the alarm flag is set during the execution of the attack, it is True Positive (TP). On the other hand, the undetected attack execution is represented by False Negative (FN).
2) Performance Counter Selection: In our detection model, we leverage HPCs as security sensors. Although the number of available counters in a processor is more than 100, it is not feasible to monitor all counters concurrently. In an ideal system, we should be able to collect data from a diverse set of events to be able to train a generic model. However, due to the limited number of concurrently monitored events, we choose the most optimum subset of counters that give us information about common attacks. For this purpose, we perform a study of the best subset of performance counters.
In our experiments, we leverage Intel PCM tool [25] to capture the system-wide traces. The set of counters in our experiments is chosen from core counters. The main reason to choose core counters is the high variety of the available counters such as branches, cache, TLB, etc. The number of core counters tested in the selection method is 36. The complete list is given in Appendix, Table IV .
In the data collection step, a subset of the counters is profiled concurrently, since the number of counters monitored in parallel is limited to four in Intel processors. For each subset, a separate dataset collected until all 36 counters are covered. The training data is collected from 30 different Phoronix benchmark tests [1] (1-30 in the Table V) . In order to decide on the most suitable counters to detect microarchitectural anomalies, we collect a test dataset from 20 benchmark tests (1-56 in the Table V) and 6 microarchitectural attacks (174-179 in the Table V) . The Zombieload attack is not included in the performance counter selection phase, since it was not released at that time. The sampling rate is chosen as 1 ms to have the minimal overhead in the system.
For every subset of counters one LSTM model is trained with a window size of W = 100. The four dimensional data is given as an input to LSTM model and then, the final counters are selected based on their F-score given in Appendix, Table IV. It is observed that some counters have better accuracy than other counters for specific attacks. For instance, branch related counters have high correlation for Meltdown and Spectre attacks. However, the F-score is also around 0.3 because real-world applications also use the branches heavily. One of the important outcome of selection phase is that speculative branches are commonly integrated in the benign applications. Therefore, the counter selection shows that branch counters are not useful to detect speculative execution attacks. Thanks to our LSTM based counter selection technique, finding the most valuable counters is fully automated and the success rate of detecting anomalies with low FPR and FNR is increased significantly.
Even though it is allowed to choose up to 4 counters on the Intel server systems like Xeon, we selected 3 counters to profile for anomaly detection. The reason behind this is that in the desktop processors (Intel Core i5, i7) the programmable counters are limited to 3. The first selected counter is L1 Inst M iss, which is more successful to detect Rowhammer, Spectre and Meltdown attacks with 14% FPR, where the F-score is 0.7979. As a second counter, L1 Inst Hit is chosen, since Flush+Flush and Flush+Reload attacks are detected with a high accuracy and the F-score is 0.8137. The reason behind the high F-score is that the flush instruction is heavily used in those attacks and the instruction cache usage also increases in parallel. Interestingly, Flush+Flush attack is known as a stealthy microarchitectural attack however, it is possible to detect it by monitoring instruction cache hit counter. The last selected counter is LLC M iss, which is successful to detect Rowhammer and Prime+Probe attacks with a high accuracy. These attacks cause frequent cache evictions in the LLC, which increases the number of anomalies in the LLC M iss counter. These results show that the individual counters are not efficient to detect all the microarchitectural attacks. Therefore, there is a need for the integration of the aforementioned 3 counters to detect all the attacks with a high confidence rate.
V. EVALUATION
In this section, we explain the experiments which are conducted to evaluate 
A. Experiment Setup
FortuneTeller is tested on two separate systems. The first system runs on an Intel Xeon E5-2640v3, which is a common processor used on server machines. It has 8 cores with 2.6 GHz base frequency and 20 MB LLC. The second device is used to illustrate a typical laptop/desktop machine, which is based on Intel(R) Core(TM) i7-8650U CPU with 1.90 GHz frequency. It has 8MB LLC and 2 cores in total.
Two types of RNN model namely, LSTM and GRU, are used to train FortuneTeller. The sliding window size, batch size and number of hidden LSTM/GRU layers are kept same in the training phase. Training of RNN models is done using the custom Keras [12] scripts together with the Tensorflow [2] and GPU backend. The models are trained on a workstation with two Nvidia 1080Ti (Pascal) GPUs, a 20-core Intel i7-7900X CPU, and 64 GB of RAM.
B. RNN Model Training
To detect the anomalies in the system, the first step is to learn the pattern of the benign applications. This is not an easy task, since the chosen benchmarks and real-world applications have complicated fingerprint in the microarchitectural level with the system noise. Moreover, the fingerprint at each execution is not identical and the execution of the application takes several seconds, which makes it difficult to learn longterm relations in the data. Therefore, the required number of measurements from each individual application plays a crucial role to train the FortuneTeller successfully. For this purpose, we choose 10 random benchmarks and a separate model for each of them is trained. The validation error obtained as a result of training is the critical metric to determine the capacity of the RNN algorithms as it indicates how well FortuneTeller guesses the next counter value. The first RNN model is trained with only 1 measurement and the number of measurements is increased gradually up to 44. It is observed that there is no further improvement in the validation error after 36 measurements for both LSTM and GRU networks in Figure 4 . Note that, the training data is scaled to [0 1] and the validation error is the average error of the 3 counters. 
Fig. 5. Prediction error in Gnupg for LSTM algorithm
In Figure 5 , the prediction of ICache.Hit counter value by using LSTM network is shown. The solid line represents the actual counter value while two other lines show the prediction values. When there is only one measurement to train the LSTM network, the prediction error is much higher. It means the trained model could not optimize the weights with small amount of data. On the other hand, once the number of measurements is increased to 36, the predictions are more consistent and close to actual counter value. The number of measurements directly affects the training time of the model. If the dataset is unnecessarily huge, the training time increases in linear. Therefore, it is decided to collect 36 measurements from each application in the training phase to achieve the best outcome from RNN algorithms in the real systems. With the accurate modeling of the benign behavior, the number of false alarms is reduced significantly. This is the main advantage of FortuneTeller, since the previous detection mechanisms apply a simple threshold technique to detect the anomalies when a counter value exceeds the threshold. In contrast, FortuneTeller can predict the sudden increases in the counter values and the correct classification can be made more efficiently than before.
C. Server Environment
The first set of experiments is conducted in the server machine. Three core counters, ICache.M iss, ICache.Hit and LLC.M iss are monitored concurrently in the data collection process. The training dataset is collected with a sampling rate of 1 ms from m = 3 core counters during the execution of benign applications. The dataset has 10 million samples (10,000 seconds) in total, collected from 67 randomly selected benchmark tests, 100 websites rendering in Google Chrome, Apache server/client benchmark, MySQL database and Office applications as listed in Appendix, Table V. Note that the idle time frames between the executions are excluded from dataset to avoid the redundant information in the training phase. Firstly, LSTM model is trained with the collected dataset where the input size is 3 × 10, 000, 000. The sliding window size is selected as W = 100, which means the total number of LSTM units equals to 100. The further details of window size analysis is given in Section V-E. The training is stopped after 10 epochs since the validation error does not improve further. The validation error decreases to 0.0015. The training time for 10,000,000 samples takes approximately 4 days.
After the LSTM model is trained, a new dataset for the test phase is collected from counters by profiling 173 benign benchmark tests, 100 random websites, MySQL, Apache, Office applications and micro-architectural attacks. The length of the test data for each application would change, since our anomaly-based model has no assumption on the input length. Hence, the number of samples obtained from each application changes between 1000-20000. The number of samples for websites is around 1000, since the rendering process is extremely fast. However, some benchmarks have a longer execution time, which requires to collect data for a longer time. The remaining applications (Office, MySQL, Apache) are profiled for around 5 seconds.
Each application is monitored 50 times, and then, the test data is fed into the LSTM model to predict the counter values at the next time steps. Moreover, in order to make the test phase more realistic, the number of applications running concurrently is increased up to 5. The applications are chosen randomly from the test list in Appendix Table V , and started at the same time. 100 measurements are collected from concurrently running applications. In total, 25,000,000 samples are collected for the test phase.
The prediction is made for all three counters at each time step (every 1 ms), and then, the mean squared error e is computed between the actual and predicted counter values. e (t+1) in the prediction step is used to choose optimal decision window D and τ A to detect the anomalies in the system. If the prediction error is higher than the threshold τ A for D samples, the application is classified as an anomaly. The threshold and decision window are chosen as to equalize FNR and FPR. The trend between τ A for D is given in Figure 6 . For the lower τ A values, the decision window is not applicable to detect the anomalies, since the benign applications and attack executions have higher error rates. Once the τ A reaches 1.8×10 6 , most of the attack executions are detected in D = 50 samples. In other words, the microarchitectural attacks are caught in 50 ms by FortuneTeller. With the increasing τ A values, the number of true positives begins decreasing, which yields to low detection rate.
The results show that P+P attack is the most difficult attack to be detected by the LSTM model in the server. This result is expected since P+P attack mostly focuses on specific cache sets and the cache miss ratio is smaller than other type of attacks. In addition, instruction cache is not heavily used by P+P attack, which makes the detection more difficult for FortuneTellerdue to the lack of specific pattern. On the other hand, the highest TPR is obtained for Flush+Reload and Rowhammer attacks with 100% and 0% FNR. As these attacks increase the number of data cache misses and instruction hits through the extensive use of clflush instruction, the fluctuation in the counter values is higher than the other types of attacks and benign applications. The accuracy of the predicting the next values decreases when the variance is high in the counters, thus, the prediction error increases in parallel.
Since the higher prediction rate is a strong indicator of the attack executions in the system, FortuneTeller detects them with a high accuracy. Note that, Zombieload is also detectable by the FortuneTeller, even though it was not included in the performance counter selection phase. This shows that FortuneTeller can detect the unseen microarchitectural attacks with the current trained models. The ROC curves in Figure 7 indicate that LSTM networks have a better capability than GRU networks to detect the anomalies. The counter values are predicted with a higher error rate in GRU networks, which makes the anomaly detection harder. Some benign applications are always detected as anomaly by GRU, thus, the FPR is always high for different threshold values. The AUC (Area Under the Curve) for LSTM model is very close to perfect classifier with a value of 0.9840. On the other hand, the AUC for GRU model is 0.9125, which is significantly worse than LSTM model. There are several reasons behind the poor performance of GRU networks. The first reason is that GRU networks are not successful to learn the patterns of Apache server applications since there is a high fluctuation in the counter values. In addition, when the number of concurrently running applications increases, the false alarms increase drastically. On the other hand, LSTM networks are good at predicting the combination of patterns in the system. Therefore, the FPR is very small for LSTM model.
D. Laptop Environment
The experiments are repeated for the laptop environment to evaluate the usage of FortuneTeller. LSTM and GRU models are trained with 10 million samples, which is collected from benign applications. Since the laptops are mostly used for daily works, the counter values are relatively smaller than the server scenario. However, the applications stress the system more than the server scenario since the number of cores is lower. When we analyze the relation between D and τ A , we observe the same situation as in the server scenario. The lower τ A values are not sufficient to differ the anomalies from benign executions. Therefore, we need to choose the optimal τ A value slightly higher than the server scenario with a value of 3.8 × 10
6 . The corresponding D value is 60, which means that the anomalies are detected in 60 ms. The decision window is 10 ms bigger than server scenario however, the performance of FortuneTeller is better in laptop scenario. In Figure 9 , the ROC curves of LSTM and GRU models are compared. The AUC value of LSTM model is considerably higher than GRU model with a value of 0.9865. However, the AUC value for GRU is 0.8925. This shows that LSTM outperforms GRU model to predict the counter values of benign applications. This also concludes that FNR and FPR are lower for LSTM models.
Among the attack executions, Rowhammer attack can be detected with 100% success rate since the prediction error is very high. The other attacks have similar prediction errors, hence, FortuneTeller can detect the attacks with the same success rate. Since the computational power of laptop devices is low, the concurrent running applications have more noise on the counter values. Therefore, the prediction of the counter The overall results show that LSTM works better than GRU networks for both laptop and server scenarios in Table II . The first and second values represent the LSTM and GRU false alarm rates per second in percentages, respectively. In the server scenario, videos, MySQL and Office applications never give false positives. Websites running in Google Chrome have a small amount of false alarm. Therefore, the FPR and FNR are around 0.12% per second for LSTM network in server scenario overall. The main disadvantage of GRU networks is the poor performance in the prediction when the number of applications increases. The FPR and FNR are approximately 0.24%. This shows that the number of false alarms is twice more for GRU based FortuneTeller.
In laptop scenario, LSTM performs better, which is supported by the false alarm rate. The number of false alarms is lower than the server scenario for laptop devices with a value of 0.09%. On the other hand, the GRU networks are lack of ability to predict the counter values, thus, it is also reflected in false alarm rates. For every application, GRU has a higher false alarm rate than LSTM networks. Therefore, it is concluded that FortuneTellershould be trained with LSTM networks to have the better performance in both server and laptop scenarios. 
E. Varying size of Sliding Window
We observed that the prediction results are affected by the size of the window. Therefore, we analyze the effect of sliding window size on anomaly detection with the data collected from core counters with 1 ms sampling rate in the server environment. 12 different window sizes are used to train LSTM and GRU models. The window size starts from 25 and increased by 25 at each step until reaching 300.
The changes in the validation error for both LSTM and GRU networks are depicted in Figure 10 . The overall GRU training error is higher than LSTM network for each window size. Both models reach the lowest error when the sliding window size is 100. Even though LSTM and GRU are designed to learn long sequences, it is recommended to choose the window size between 50-150. Since the best error is obtained with a window size of 100, all the models in the previous experiments are trained with this parameter. It is also important to note that the training time increases proportional to the size of the window.
F. Time consumption for Testing
The dynamic detection of the anomalies also depends on the time consumption of predicting the next counter values. Therefore, the sampling rate should be chosen as close as to the timing consumption of predicting the next value. In our experiments, we observed that the prediction time is proportional to the size of the model. Since GRU has less number of cells in the architecture, the prediction of GRU is faster. While LSTM outputs prediction values for 3 counters in 2 ms, the prediction time for GRU is 1.7 ms. It shows that GRU is 15% faster than LSTM in the prediction phase. However, due to the high FPR of GRU networks FortuneTeller is trained with LSTM networks to detect anomalies in the real time system.
G. Performance Overhead
The performance overhead of the proposed countermeasures is one of the most important concerns, since it affects all the applications running on the system. In this section, we evaluate the performance overhead for both server and laptop devices when core counters are used to collect data. The overhead amount is obtained with sampling rates of 1 ms and 10 µs. As it is expected the performance overhead increases in parallel with the sampling rate. In the server environment the overhead is around 7.7% when the sampling rate is chosen as 10 µs. The overhead of individual tests fluctuates between 1% and 33% for benign applications. The performance of system and memory benchmarks is affected more than processor based benchmarks. On the other hand, when the sampling rate is decreased to 1ms, the performance overhead drops to 3.5%. The individual overheads change between 0.3% and 18%, which is more stable than the previous case.
In laptop scenario, the performance overhead is also calculated with the same benign benchmarks. The number of cores is smaller than the server scenario. Therefore, the performance monitoring unit only needs to read the counter values from 4 threads in parallel. On the other hand, since the system has lower features compared to the server machine, the overhead is increased when the sampling rate is increased .The overall performance degradation is 24.88% for 10 µs. The overhead fluctuates heavily, which means that the applications suffer from the frequent interruptions to read the counter values. Once the sampling rate is decreased to 1 ms, the overhead drops to 1.6%, which is applicable in real-time systems. This overhead is also lower than the server scenario. Therefore, We preferred 1 ms sampling rate in our experiments.
VI. COMPARISON OF FortuneTeller WITH PRIOR DETECTION METHODS
There are several studies focused on microarchitectural attack detection as given in Table I . While some works [7] , [10] , [67] use unsupervised techniques, Mushtaq et al. [41] benefits from supervised ML methods. All proposed methods claim that the false positive rate is very low in a real world scenarios. However, all these detection techniques are only applied for cryptographic implementations (AES, RSA, ECDSA etc.) and specific cache attacks (F+R, F+F, P+P). The performance of these techniques in real world scenarios (noisy environment, multiple concurrent processes) against transient execution attacks (Meltdown, Spectre, Zombieload etc.) and Rowhammer is questionable. In order to evaluate 4 proposed methods and FortuneTeller, we collected 6 million samples (4 million benign executions, 2 million attack executions) with 1ms sampling rate from 10 benign processes and 7 microarchitecture attacks by using system-wide counters. Note that each benign and attack execution is monitored 100 times in the server environment. The benign processes are chosen from diverse set of applications such as Apache, MySQL, browser and cryptographic implementations. The attacks cover cachebased, transient execution and Rowhammer attacks given in Table V . The detection algorithms from previous works are rewritten in Matlab environment and tested with the collected data. CPD from Briongos et al. [7] The first approach is Change Point Detection (CPD) which was implemented by Briongos et al. [7] to detect the anomalies in the victim process. The primary advantage of the method is to have the capability of self-learning by observing the number of cache misses.
On the other hand, the assumption of having almost no LLC miss is a strong assumption, which is not applicable in realworld scenarios for system-wide profiling. Especially, when an application runs for the first time in the system, the number of cache misses increases drastically. This yields to high number of false positives at the beginning of the applications. Even though it is tried to eliminate the initial false positives by increasing the initial value of cache misses under attack (µ a ), we still observe several false positives at the beginning. It is also difficult to monitor each PID in the system since there are hundreds of processes running at the same time.
For the evaluation of CPD method, we use the initial value of µ a = 100 and β = 0.65. When CPD method is applied to our dataset, we observe that the FPR is 3% and FNR is 10%. Therefore, the F-score is 0.9372. However, with the increasing number of concurrent processes, the false positive rate increases. This shows that CPD method is efficient for low system load however, it gives more false positives with increasing workload. The estimated detection time is around 300 ms for attack executions. The detection performance for Rowhammer and P+P attacks is poor since the number of cache misses is not high compared to benign processes. Therefore, these two attack types increase the FNR.
DTW from Zhang et al. [67] The second detection method was proposed by Zhang et al. [67] , which benefits from Dynamic Time Warping (DTW) to detect the cryptographic implementations and then, the LLC hit and miss counters are monitored to detect the attacks. In the first step, DTW is used to compare the test data and the signature of cryptographic implementations obtained from branch instructions. Secondly, when the distance between test and target execution is very small, the LLC hit and LLC miss counters are monitored. If there is a sudden jump in these two counters, the anomaly flag is set. Again, this approach requires the PID of the monitored process.
In the evaluation of the method, we started with the application detection. Since the number of target applications is small in our dataset, DTW can detect them with 100% success rate in a noiseless environment. However, when there is a concurrent process running in the system, the DTW distance is always high. The reason behind this failure is that branch instructions are heavily affected by the other processes. Therefore, DTW is not suitable for real-world scenarios. Another drawback is that if the microarchitectural attack already started, the branch instructions is also affected, which prevents to detect the target application. Hence, the attack detection step never starts. When the target process is detected, the anomaly detection step begins. If another concurrent work starts running at the same time, the cache miss and hit counters start increasing, which increases the FPR extremely. Since there is only a simple threshold approach to detect the attacks and the proposed decision window (5 ms) is too small, the FPR raises. In these circumstances, the approach achieves 10% FPR. The attack detection is also not great since it is not possible to detect F+F attack with cache miss and hit counters. Thus, the the FNR increases in parallel which yields to 20% FNR. Overall, the detection technique has 0.8572 F-score.
PDF from Chiappetta et al. [10] In the third study, we evaluate the performance of normal distribution and probability density function, which is proposed by Chiappetta et al. [10] . The detection technique monitors five counters (total instructions, CPU cycles, L2 hits, L3 miss and L3 hits) to catch the anomalies in the cryptographic implementations. This technique is used in unsupervised manner by only learning the normal distribution of the attack execution (F+R) with its mean and variance in the system. After the normal distribution is calculated, the probability density function (pdf) of both attack and benign executions is calculated for each counter sample. Then, an optimal threshold ( ) is chosen to separate the benign and attack processes. To evaluate the performance of the method, we collected a separate dataset with the aforementioned five counters. The results indicated that total instructions and L2 hits decrease the performance of detecting anomalies. On the other hand, L3 hits and miss counters overperform other counters. The main drawback of this method is that there is no learning and the decisions are made on only cache miss and variance values. Therefore, when there is a benign application with high variance and mean, it is more likely to be classified as an anomaly. Especially, Apache server benchmark and videos running in browsers give high FPR. It is also observed that the P+P, F+F and Rowhammer attacks are not detected with a high accuracy, which give 0.2145 and 0.3732 for FPR and FNR, respectively. The Fscore of the detection technique is 0.7278.
OC-SVM from Mushtaq et al. [41]
The last method to compare is One Class Support Vector Machine (OC-SVM), which is used by Mushtaq et al. [41] to detect the anomalies on cryptographic implementations. The scope is limited F+F and F+R attacks. The number of counters tested in [41] is higher than three, which makes it impossible to monitor all of them concurrently. Therefore, we chose three counters (L1 miss, L3 hit and L3 total cache access), which give the highest F-score. Even though OC-SVM was used in a supervised way in [41] , we used it in unsupervised manner to maintain the consistency in the comparison. In the training phase, the model is trained with the 50% of the benign execution data. Then, the attack and benign dataset are tested with the trained model. The obtained confidence scores are used to find the optimal decision boundary to separate the benign and attack executions. The optimal decision boundary shows that the FPR and FNR are 0.2750 and 0.2778, respectively. The main problem is that OC-SVM is not sufficient to learn the diverse benign applications, which increases the FPR drastically. Moreover, Rowhammer and F+F attacks are not detected, which is the reason of higher FNR. Therefore, the F-score remains at 0.7240.
FortuneTeller Finally, we apply FortuneTeller to detect the anomalies in the system. Since the diversity of the benign executions is smaller in the comparison dataset, it is more easier to learn the patterns. It is also important to note that 50 measurements from each benign application is enough to reach the minimum prediction error. Once the LSTM model is trained with the benign applications, the attack executions and remaining benign application data are tested. The FPR and FNR remains at 0.2% and 0.4%, respectively. The F-score is 0.997 for the FortuneTeller. The comparison results are summarized in Table III . The lack of appropriate learning is significant in the wild. It is also obvious that even simple learning algorithm such as CPD can help to overperform other detection techniques. We also show that the detection accuracy increases by learning the sequential patterns of benign applications with the system-wide profiling. Therefore, it is significantly important to extract the finegrained information from the hardware counters to achieve the low FPR and FNR. The common deficiencies of previous works are listed below:
• The detection methods focus on only cryptographic implementations, and the latest attacks such as Rowhammer, Spectre, Meltdown and Zombieload are not covered.
• There is no advanced learning technique applied in the detection methods. They mostly rely on the sudden changes in the counters, which increases the FPR heavily.
• The detection methods are either tested under no noise environment or the workload is not realistic. In addition, the FPR is not tested with a diverse set of applications.
VII. DISCUSSION
Bypassing FortuneTeller One of the questions about dynamic detection methods is that how an educated adversary can bypass the detection model? The common way is to put some delays between the attack steps to avoid increasing the counter values. For this purpose, we inserted different amounts of idle time frames between attack steps in Flush+Flush, Prime+Probe and Flush+Reload. We observed that the prediction errors in GRU and LSTM networks increases in parallel with the amount of sleep due to the high fluctuation. This shows that introducing delays between attack steps is not an efficient way to circumvent FortuneTeller. The reason behind this is the fluctuation in the time series data is not predicted well in the prediction phase. Therefore, we concluded that putting different amount of sleep between the attack steps is not enough to fool FortuneTeller. On the other hand, crafting adversarial examples is an efficient way to bypass Deep Learning based detection methods. For instance, Rosenberg et al [45] shows that LSTM/GRU based malware detection techniques can be bypassed by carefully inserting additional API calls in between. Therefore, crafting adversarial code snippets to change the performance counters in the attack code may fool FortuneTeller. The main difficulty in this approach is that it is not possible to decrease the counter values by executing more instructions between attack steps. Therefore, applying adversarial examples on hardware counter values is not trivial. Training Algorithm FortuneTeller investigates both available long-term dependency learning techniques. We observed that GRU performs worse than LSTM networks to predict the counter values in the next time steps. This is because of the lack of internal memory state, which keeps the relevant information from previous cells. This result is also supported with the high FPR and FNR of GRU networks. Since the prediction error increases for attack executions more than benign applications, the detection accuracy decreases. Therefore, we recommend to train LSTM networks for microarchitectural attack detection techniques. Dynamic Detection The current implementation requires to have a GPU to train FortuneTeller, as GPU based training 40 times faster than CPU based training. The training is mostly done in an offline phase and it does not affect the dynamic detection. On the other hand, dynamic detection heavily depends on the matrix multiplication, since the trained model is loaded as a matrix in the system and the same matrix is multiplied with the current counter values. Hence, the required time to predict the next counter values is lower. In addition, we observed that the performance overhead is negligible for the matrix multiplication in the CPU systems. Therefore, FortuneTeller can be implemented in server/cloud/laptop environments, even though there is no GPU integrated in the system.
VIII. CONCLUSION This study presented FortuneTeller, which exploits the power of neural networks to overcome the limitations of the prior works, and further proposes a novel generic model to classify microarchitectural events. FortuneTeller is able to dynamically detect microarchitectural anomalies in the system through learning benign workload. In our study, we adopted two state-of-the-art RNN models: GRU and LSTM. We concluded that LSTM is more preferable compared to GRU for our use case. Further, the number of measurements and the sliding window size have a significant effect on the validation error in training phase, which makes it crucial to choose the optimal values to have better prediction results. FortuneTeller is applicable to both server and laptop environments with a high accuracy. In order to evaluate the performance of FortuneTeller, we used both benchmarks and real-world applications and achieved 0.12% and 0.09% FPRs for server and laptop environments, respectively. FortuneTeller is also tested against previous works in the realistic scenarios and it is concluded that, FortuneTelleroverperforms other detection meahcnisms in the wild. While the performance overhead in laptop environment is less than server, FortuneTeller is still applicable in the real world systems with minimal overhead. 
IX. APPENDIX A. Tables for Performance Counters and Benchmarks
