Meltdown is a microarchitectural side-channel attack that extracts sensitive data in the kernel space of operating systems (OSs). Meltdown deliberately creates transient executions by exploiting an outof-order execution technique and obtains the execution results through a cache covert channel. In a previous attack, an OS signal handler and hardware transactional memory support (i.e., Intel TSX) were used to establish the cache covert channel. However, both methods restricted the effectiveness of the attack owing to the large amount of system noise caused by the context switching of signal handlers and the narrow range of TSX-enabled processors. Hence, we propose a new variant of the Meltdown attack using a return stack buffer (RSB). The RSB enables the establishment of a low-noise cache covert channel without relying on processor-specific hardware features, such as TSX. The wide usage of the RSB in commodity processors further improves the effectiveness of the proposed attack. We present the details of our implementation of the attack and evaluate the performance. Furthermore, we overview several existing countermeasures against the proposed attack.
I. INTRODUCTION
For decades, performance improvement has been challenging in microarchitecture design. Consequently, optimization techniques such as out-of-order and speculative executions have been applied to modern processors. However, the design strategy that pursues performance results in serious security vulnerabilities in hardware, thus allowing transient execution attacks [2] .
Meltdown [3] is a recently discovered transient execution attack. By leveraging out-of-order execution, Meltdown enables unauthorized access to kernel memory without exploiting bugs in software. The attack is initiated by executing a load instruction with a kernel address. Because of access violation, page fault will be raised eventually when the instruction is retired. The out-of-order execution engine, however, would have already permitted the execution of subsequent instructions before the retirement. Although those instructions will be discarded later, their transient execution leaves a footprint on cache. The footprint is subsequently The associate editor coordinating the review of this manuscript and approving it for publication was Jiafeng Xie. decoded into a secret and delivered to the attacker through a cache covert channel using the Flush+Reload technique [4] .
The procedure of the Meltdown attack comprises two distinct stages: (1) in the first stage, transient execution is triggered; (2) in the second stage, the execution result is delivered through the cache covert channel. To obtain the correct data in the second stage, the cache footprint created in the first stage should remain intact during the attack process. However, the cache is extremely fragile because of the nondeterministic behavior of operating systems. That is, the result of the first stage is highly likely to be corrupted by other concurrent system activities during transition to the second stage. Hence, the most challenging problem for a successful attack is to initiate the second stage as soon as possible after the first stage.
Therefore, two approaches are proposed in the original Meltdown attack: (1) a signal-handler-based method and (2) a TSX-based method. A signal handler is a user-level exception handling mechanism provided by an OS. Any exceptions (e.g., page fault) occurring in the processor will be detected by a kernel first. The kernel subsequently selects an appropriate signal handler and transfers its control to the handler to manage the exception. In the signal-handler-based method, VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ a code for the cache covert channel (i.e., the Flush+Reload code) is located in a signal handler; the code is automatically invoked whenever a page fault is raised. However, context switching from kernel to user-land is necessary to initiate the second stage in this method. Thus, it is infeasible to avoid a significant amount of noise in the cache convert channel. An alternative to the signal-handler-based method is to utilize TSX, a hardware feature for transactional memory support in Intel processors. The TSX-based method exploits the property where any exceptions raised during a transaction will be suppressed to a (user-mode) abort handler, which eliminates the need for context switching. Owing to its exception suppression, the TSX enables a fast and low-noise cache covert channel. However, the TSX-based method is highly restrictive because only a narrow range of TSX-enabled processors is available.
To summarize, the original Meltdown attack suffers from non-negligible problems of both methods: a large amount of system noise from context switching and low usage of the TSX feature. These problems significantly restrict the effectiveness of the Meltdown attack.
We herein propose a new variant of Meltdown that improves the effectiveness of the attack. Our method utilizes a return stack buffer (RSB), a type of branch prediction unit equipped in most commodity x86 processors. The basic concept is to deliberately cause a misprediction on the RSB by modifying a return address to create a transient execution.
After an execution rollback, the control will be immediately transferred to the correct destination (i.e., the return address), where a Flush+Reload code is located. Using an RSB eliminates the need for context switching. Hence, a lownoise cache covert channel can be achieved without relying on any processor-specific hardware features such as TSX. Moreover, the wide usage of RSBs in processors, contrary to TSX, improves the effectiveness of Meltdown-type attacks in practical systems.
Although the concept is intuitive, it is not trivial to implement the proposed attack. The challenging problem is that the transient execution window opened by the RSB is not sufficiently large to execute all the instructions of the Meltdown code. We solve this problem by inserting a number of arithmetic instructions that perform some calculations with a return address. The calculations introduce a long data dependency chain among instructions, which delays the resolution of the return address. Thus, our solution can effectively increase the size of the transient execution window.
The contributions of our study are as follows:
• We propose a more effective Meltdown attack using an RSB.
• We propose a novel solution that enlarges the transient execution window created by the RSB.
• We present an implementation of the proposed attack in detail as well as the performance evaluation results.
• Finally, we overview several countermeasures against the proposed attack.
The remainder of this paper is organized as follows. In Section II, we provide some background knowledge regarding our attack. In Section III, we explain the original Meltdown attack and its limitation. In Section IV, we describe the proposed attack in detail. In Section V, we present an implementation of the proposed attack as well as our evaluation results on the implementation. In Sections VI and VII, we present several countermeasures against our attack and discuss related studies, respectively. Finally, in Section VIII, we conclude this paper by summarizing our work.
II. BACKGROUND
In this section, we present the key features of modern processors and basic knowledge regarding cache covert channels.
A. OUT-OF-ORDER AND SPECULATIVE EXECUTIONS
To maximize performance, instruction-level parallelism (ILP) is adopted as a key principle in designing the microarchitecture of modern processors. ILP is enabled by two optimization techniques: out-of-order and speculative executions.
1) OUT-OF-ORDER EXECUTION
The Intel x86 architecture comprises a complex instruction set. Hence, x86 processors first split individual instructions into smaller micro-operations (i.e., µ-ops) prior to execution. As only actual µ-ops must be implemented in hardware, the microarchitecture design of x86 processors is simplified.
When executing a sequence of µ-ops, the processor may not follow the order of instructions in the program code. Instead, µ-ops are executed out of order as soon as any resources (e.g., source operands or execution units) become available. The out-of-order execution technique follows Tomasulo's algorithm [5] . In this algorithm, µ-ops are dynamically scheduled using a reservation station; any available resources are notified through the reservation station. After the µ-op execution is completed, the result is temporally stored in a reorder buffer (ROB). The result will be committed to an architectural state (i.e., registers or memory) only upon the retirement of the corresponding instruction. Regarding instruction retirements, the processor should ensure that µ-ops are committed in order according to the program code. Otherwise, the execution will yield incorrect architectural behaviors.
2) SPECULATIVE EXECUTION
Managing branch instructions is challenging in modern outof-order processors. As deciding the branch result is time consuming, processors utilize speculative execution as another optimization technique. Speculative execution predicts the outcome of the branch instruction and speculatively jumps to the predicted destination before the resolution. Speculative execution uses special hardware, such as a branch prediction unit (BPU) and a branch target buffer (BTB). A BPU is used to predict the outcome of conditional branch instructions (i.e., taken or not taken), while a BTB is used for indirect branches such as indirect calls or jumps. If the prediction is correct, substantial performance gains will be obtained from the speculative execution. Otherwise, the processor should flush all the instructions in the pipeline and re-execute the instructions at the correct destination.
B. CACHE COVERT CHANNEL
Cache is a hardware component in a processor that minimizes the gap of latency between a register and a memory. Cache covert channeling is an attack technique to extract confidential data from another side by exploiting the cache contention as a leakage source.
Flush+Reload attack [4] , [6] is a technique for the cache convert channel. It utilizes page sharing between a sender and a receiver, in which some cache lines are shared between them. The attack proceeds in three phases. In the Flush phase, the receiver flushes a cache line shared with the sender from the entire cache. In a x86 architecture, a clfush instruction enables the flushing of the cache line. During the Wait phase, the receiver waits for a certain amount of time while the sender performs a write operation. Finally, in the Reload phase, the receiver reloads the flushed line again and measures its latency. A lower reload time (i.e., cache hit) indicates that it has been accessed by the sender during the wait phase, which means that the receiver reads ''1''.
On the contrary, a higher reload time (i.e., cache miss) means that the line was not accessed and still resides in the memory; therefore, the receiver reads ''0''.
The Flush+Reload technique is not applicable when page sharing is unavailable or disabled. In this case, we can use the Prime+Probe technique [7] as an alternative to the Flush+Reload technique.
III. MELTDOWN ATTACK AND ITS LIMITATION
In this section, we present the original Meltdown attack [3] in detail and discuss its limitations.
A. MELTDOWN ATTACK
Meltdown attack relies on an out-of-order execution technique of modern processors to read unauthorized data in kernel memory. Fig.1 illustrates an overall attack process. As shown in Fig.1(a) , the attack code consists of only several x86 instructions. The register %rcx contains a virtual address in the kernel space, where unprivileged access from user space is prohibited. In the load instruction (at line 1), a processor attempts to read the memory at the address specified by %rcx and load its value to %rax. Contrary to intuition, this single instruction invokes a complicated execution process in the microarchitecture. In particular, a series of µ-ops is internally executed to perform the load operation ( Fig.1(b) ). Prior to memory access, the virtual address must be translated into a physical address, followed by page table entry (PTE) loading and permission verification. For further optimization, the processor continues reading the memory (or cache) immediately after the physical address has been determined while the permission is being verified. Because of access violation, the permission verification will eventually fail, and a page fault exception will be raised consequently at the instruction retirement.
The insight of Meltdown attack is that an execution of multiple µ-ops for the load instruction may cause stalls in the pipeline. This allows out-of-order executions of the following instructions (lines 2-3 in Fig.1(a) ) to be performed before the retirement of the first instruction. The subsequent instructions VOLUME 7, 2019 create memory access to another location, where its address is determined from %rax and %rbx. The %rbx register contains an attacker-controlled base address, while %rax contains a single-byte value loaded at the kernel address (i.e., %rcx). Therefore, the memory access will leave a footprint on cache, in which the location is solely dependent on the value of %rax. The executions of subsequent instructions will never be completed owing to the access violation of the first instruction. However, the affected cache status remains unchanged even after the execution results have been rolled back.
As a distinguishable footprint is produced on cache with respect to the kernel value (i.e., %rax), the value can be learned through the cache covert channel using the Flush+Reload technique. In the Meltdown attack, two approaches are applicable for receiving the value from the covert channel. In a signal-handler-based approach (Fig.1(c) ), receiving using the Flush+Reload code is performed in the context of a user-mode signal handler. This method originates from an intuitive concept where a page fault introduces an exception handling process of OSs. Another approach is to utilize the TSX, a hardware transactional memory support in Intel processors ( Fig.1(d) ). TSX allows a page fault to be directly handled in a user-mode transaction abort handler. This eliminates error-prone context switching, which helps to achieve a high signal-to-noise ratio in the cache covert channel.
B. LIMITATION
The aforementioned methods to receive a kernel value via the cache covert channel are conducted in the context of page fault handlers: in a signal handler or in a TSX abort handler. We discuss some limitations of those approaches.
1) LIMITATION OF SIGNAL-HANDLER-BASED METHOD
This method utilizes an OS mechanism that allows exceptions to be handled by user-mode signal handlers. Specifically, it is implemented by registering a Flush+Reload function as a user-defined signal handler. Hence, the function will be invoked whenever a page fault exception occurs.
Despite its simplicity, however, the internal process of passing exceptions to the signal handler is extremely complicated. As shown in Fig.1(c) , a page fault is first passed to an OS-registered default exception handler, which runs in kernel mode. The default handler subsequently seeks an appropriate signal handler to manage the exception and transfers its control to the handler.
As the signal handler is supposed to run in user mode, the execution context must be switched from kernel to user mode, which requires OS intervention. However, the OS does not always relinquish its control to the signal handler immediately after context switching. In certain cases (e.g., an interrupt occurred), the OS may select another process with higher priority based on its scheduling polity. This implies that a footprint on cache created by transient executions is more likely to be polluted by other processes and OS execution. In practical environments where hundreds of processes are executing concurrently, the cache covert channel may become unreliable due to the undeterministic scheduling behavior.
Therefore, the signal-handler-based method suffers from a high amount of system noise that occurs during context switching.
2) LIMITATION OF TSX-BASED METHOD
As an alternative to a signal handler, TSX can be exploited to establish the cache covert channel. TSX is a hardwarebased mechanism that supports transactional memory in Intel processors. To enhance software performance, TSX enables data synchronization in a multithreaded execution environment without software locks. In particular, it implements optimistic executions in a transactional code region; it allows transactions to be executed concurrently without synchronization and rolls back the transaction if memory conflicts are detected. Hence, TSX provides an abort handler by which aborted transactions can be handled properly. Interestingly, the TSX abort handler suppresses all CPU exceptions from being trapped to the OS. That is, any exceptions that occur in the transactional code region are passed directly to the abort handler.
By exploiting the suppression, the Meltdown attack can be executed within the transaction. The Flush+Reload function, which executes in the context of a TSX abort handler, will be immediately invoked upon a page fault (Refer to Fig.1(d) ). Hence, context switching is eliminated during the process, thus improving reliability of the cache covert channel significantly.
However, TSX technology is not equipped in all Intel processors. In fact, TSX is originally designed to support server applications such as database management systems. Hence, most consumer-grade processors lack this functionality, thus reducing the effectiveness of the TSX-based method. Additionally, critical bugs found in TSX pushes a CPU vendor to disable the feature in recent processors [8].
IV. MELTDOWN ATTACK WITH RETURN STACK BUFFER
Despite its novelty, the original Meltdown attack [3] presents several limitations, as described in the previous section, i.e., (1) a high amount of system noise in the cache covert channel and (2) low usage of the TSX feature.
In this section, we propose a new variant of the Meltdown attack that overcomes all the limitations. Our approach utilizes an RSB, which is a type of branch prediction unit equipped in most commodity processors, to create transient executions. The RSB allows a cache covert channel to be established without context switching, and hence achieves a high signal-to-ratio of the channel. In addition, its wide usage various processors, contrary to the TSX, improves the effectiveness of Meltdown-type attacks in practical systems.
A. RSB
In a x86 instruction set architecture, a calling mechanism is implemented with a pair of instructions: call and ret. These instructions rely on a stack; call saves its return address on the stack before jumping to the destination, and ret transfers control to the return address by retrieving from the stack. As memory accesses are always performed on those instructions, the x86 calling mechanism suffers from long execution delays.
To minimize the delay, CPU vendors devised an RSB for branch prediction; they exploit the fact that ret is a branch instruction whose destination is highly predictable. An RSB is a hardware buffer comprising 16˜32 entries, where each entry stores a return address of a call instruction. The number of entries varies by the microarchitecture.
At the execution of a call instruction, a processor internally pushes a return address on top of the RSB. When the corresponding ret instruction is encountered, the processor pops the return address from the RSB while retrieving it from the stack. As long cycles are required to obtain the return address from the memory, the processor attempts to speculatively jump to the destination predicted from the RSB. Provided that the return address has not been manipulated on the memory, a correct prediction is highly possible; additionally, the execution delay will be minimized. If the prediction is not correct, however, then the speculatively executed instructions should be rolled back. Fig.2 illustrates the usage of an RSB in the x86 calling mechanism. Two functions, func1 and func2, are called in a nested manner by call instructions, whose return addresses are retaddr1 and retaddr2, respectively. Fig.2(a) shows the status of a stack and the RSB at the time of executing a ret instruction in func2. The processor loads retaddr2 from the stack while consulting the RSB for prediction. The processor can yield correct predictions as the entry in the RSB matches with the return address in the stack. to newaddr prior to the ret instruction can cause a mismatch between the stack and RSB. This will cause a misprediction to retaddr2 and eventually transient executions at the mispredicted address.
B. THE PROPOSED METHOD
We present a new variant of Meltdown attacks that outperforms the previous technique. In our method, a cache covert channel is established by exploiting speculative executions on the RSB, rather than by leveraging page fault handlers as in the previous method. Hence, our method relies on neither context switching nor processor-specific hardware features such as the TSX. More specifically, we intentionally cause mispredictions on the RSB to create transient executions that read kernel bytes. After an execution rollback, the control is transferred to the correct return address, where a Flush+Reload code is located. Fig.3 illustrates the overall code layout of our attack. A call instruction that branches to a function named func is located at the lowest address. The Meltdown code, where an attempt to access kernel memory will occur, resides at org_regaddr, an address immediately after the call instruction; in fact, org_regaddr is a return address of the function call. The body of the function func and the code for the cache covert channel (i.e., the Flush+Reload code) follow the Meltdown code in sequence.
Our attack proceeds as follows. First, it is initiated by executing a call instruction that branches to func (See (1) in Fig.3 ). As a result of the execution, a return address org_retaddr will be pushed to the stack and RSB. Once the function func has been invoked, it tries to manipulate the return address that is stored on the stack (See (2) in Fig.3 ). More specifically, the function replaces org_retaddr with new_retaddr, which refers to a location of code that extracts a secret through the cache covert channel (i.e., the Flush+Reload code).
At the end of the function, the processor executes a ret instruction (See (3) in Fig.3) . Simultaneously, the processor attempts to predict a return address by consulting the RSB for a possible destination. The RSB maintains the caches of pairs of a call instruction and its corresponding return address, which refers to the next location after the call instruction. In our case, it is org_retaddr, the address of the location where the Meltdown code resides.
Based on the prediction, the processor speculatively jumps to the address org_retaddr and executes a sequence of instructions, which allows unauthorized access to the kernel (See (3a) in Fig.3) . The speculation will eventually fail as a return address has been previously modified by the function func. Hence, the processor cancels the execution and discards its result. However, the microarchitectural state (i.e., cache state) still remains unchanged by the transient execution even after the result has been discarded.
At this point, the processor has obtained the correct return address (i.e., new_addr) from the stack. After the rollback, the processor jumps to the destination, where the Flush+Reload code is located (See (3b) in Fig.3 ). This code measures the cache state using the Flush+Reload technique. The measurement results are subsequently decoded to extract kernel values in the specified address.
CHALLENGING PROBLEM AND OUR SOLUTION Fig.4(a) shows a snippet of a code that manipulates the return address in the function func. The mov instruction stores a value of %r9 register, which is a modified address new_addr, to the stack referenced by the stack pointer %rsp. Subsequently, the ret instruction attempts to load the address on the stack.
The problem in the code above is that it is infeasible to have sufficiently large windows of transient executions to execute the Meltdown code until the ret is retired. This is because processors utilize ILP. That is, modern superscalar processors typically have multiple execution ports to enable parallel executions of instructions. Specifically, various types of instructions that perform arithmetic, load, or store operations can be served by different execution ports. This implies that the mov and ret instructions in the code above, which perform store and load operations, respectively, can be dispatched to different ports in parallel. Moreover, a store buffer equipped in the processor maximizes the parallelism in a pair of load and store. That is, a value of the store operation (i.e., %r9) will be forwarded directly from the store buffer to the following load operation instead of retrieving the memory (This is called store-to-load forwarding). Hence, the ret instruction can immediately determine the return address, thereby giving no chances for speculative exeutions.
For the proposed attack to execute successfully, an effective method to increase the window of speculative execution is required. Our solution is to create a long dependency chain on a value of the store operation to delay its execution. Specifically, we insert a number of arithmetic instructions prior to the mov instruction. These instructions perform a series of arithmetic computations on the store value (i.e., %r9). The computation does not change the value. However, it introduces a delay in the resolution of the value when executing the mov instruction. Consequently, a delay in dispatching to the store port will occur until the completion of the resolution. Hence, we can disrupt the store-to-load forwarding to enable a speculative execution at the predicted destination of the ret instruction. Fig.4(b) shows a modified code snippet for our solution. We inserted a sufficient number of add instructions, which perform the addition of %r9 with %rax in the code. As the %rax register has a zero, the additions never change the store value. However, it causes a delay in the execution of the store operation; the delay length may be determined from the number of add instructions. Note that we can replace the add with any kind of arithmetic instructions such as imul. In fact, the use of imul instructions can make the dependency chain much longer since imul takes longer cycles than add. 
V. IMPLEMENTATION AND EVALUATION
In this section, we describe our implementation of the proposed attack in detail. Furthermore, we present the experimental results of the implementation and our evaluation.
A. IMPLEMENTATION
We implemented the proposed attack based on libkdump [9] , an open-source library that reads kernel memory by mounting Meltdown attacks. This library was developed for assessing the security of practical computer systems against Meltdowntype vulnerabilities. We extended the library by writing an additional function libkdump_read_rsb() that implements our attack. Fig.5 shows a source code of the libkdump_read_rsb() function. Each line of an inline assembly in the source is described as follows:
• Line 7: A call instruction that initiates our attack. It branches to a function func located at lines 11˜18.
• Lines 8˜10: These instructions are supposed to be speculatively executed by the branch prediction of the RSB. The %rcx register has a kernel address where a byte will be read. The byte is decoded into an index of an array whose base address is contained in the %rbx register. Memory access is subsequently performed to the address of the index to load the corresponding cache line on the cache.
• Lines 11˜18: These instructions form the body of the function func. In this function, a return address on the stack is replaced to a location labeled measurement (Line 19). To create a long data dependency chain, add instructions are repeatedly inserted 30 times (Lines 14 16). We may use other arithmetic instructions such as imul instead. The number of repetitions was determined experimentally as it is sufficient to complete transient executions.
• Lines 19˜31: A function do_measure is called through these instructions (Lines 20˜26). This function measures the cache state generated during the transient execution via the Flush+Reload technique. If the measurement fails, it will be retried through the loop (Lines 28˜31).
B. PERFORMANCE EVALUATION
In this section, we evaluated the performance of the proposed method based on microarchitectural benchmarking results. The benchmarking was conducted on a host with Intel Xeon E5-2620v4 CPU running Ubuntu 14.04 LTS. As the latest version of the OS has already been patched against Meltdown vulnerability, we used non-patched, old version of Linux in the experiments. Note that the same vulnerable setting can be made even in the patched OS by passing the 'nopti' Linux kernel boot parameter.
1) EVALUATION UNDER VARIOUS WORKLOADS
We evaluated the performance in terms of the success rate (or error rate) and throughput under various workloads. In order to simulate the workload, we used stress-ng, 1 a tool that gives stress to the system in various selectable ways.
In the experiment, we used cache and CPU-intensive workloads (i.e., stressors) in stress-ng; we executed those stressors simultaneously while trying to read kernel bytes through Meltdown. The cache stressor was executed in two configurations; one that runs on the same physical core with the Meltdown code and the other that runs on the different core, which can measure with noise on L1 and L3 cache, respectively. The CPU stressor was only executed on the same physical core. Table 1 shows the results of the experiments. With no workloads (i.e., no stressors running in background), the proposed method (referred to as RSB in the table) can read kernel data with an error rate of 0.1%. It takes 3420 cycles on average to read a single kernel byte, which accordingly yields 456 KB/s in the throughput. For the signal-handlerbased method (referred to as SH), the error rate is 8% and its averaged latency was measured to be 18,100 cycles (i.e., 86 KB/s in its throughput). The latency is approximately 5 times larger than those of other methods. We conclude that the delay is mainly caused by the kernel-to-user context switching in the execution.
When stressors are enabled, error rates are getting increased for all the methods, especially under L1 cache and CPU-intensive workloads. Among them, the signal-handlerbased method shows the highest error rate, which implies that it is fragile to system noise.
It is worth noting that the proposed method shows the best performance regarding the throughput, though its error rates are almost the same as those of TSX-based method under various workloads. As shown in the table, the difference comes from the averaged latency in reading a byte; the proposed method has a lower latency than the TSX-based method.
2) THE LENGTH OF THE TRANSIENT WINDOW
The proposed method creates transient execution by constructing the dependency chain of arithmetic instructions. We performed an experiment to figure out how large the transient window can be created by the dependency chain.
In the experiment, we implemented a Meltdown code with a number of dummy instructions inserted right after the call instruction (i.e., after the line 7 in Fig.5 ). The dummy instructions just perform some arithmetic operations with no side effect. Then, we measured the success rate of reading kernel data for the Meltdown code varying the number of dummy instructions. Fig.6 shows the experimental result. The graph in the figure depicts how many instructions can be executed transiently according to the length of dependency chain in calculating the return address. The dependency length, referred to as 'C' in the graph, is actually determined from the number of repetitions of the arithmetic instructions (i.e., the lines 14-16 in Fig.5 ). For instance, the success rate drops off with more than 74 transient instructions if the length of dependency chain is 10, which implies that the maximum length of the transient window for C = 10 is 74. We also observe that more transient instructions can be achieved with larger length of the dependency chain; we obtained the maximum length of 100 transient instructions for C = 50. However, it is infeasible to achieve more than a hundred of transient instructions since the capacity of a reorder buffer is limited.
3) COMPARISON WITH OTHER EXCEPTION SUPPRESSION METHODS
There are several exception suppression techniques for Meltdown-type attacks, which are similar to the proposed method. For instance, an RSB-based method is also used in LazyFP attack [10] ; unlike our method, it creates transient execution by flushing the return address on the stack memory. Causing misspeculation in branch instructions is another exception suppression method used in recently discovered microarchitectural attacks such as RIDL [11] and Fallout [12] . Refer to Section VII for more details on other exception suppression methods.
We evaluated the performance of all these methods in terms of the success rate and latency. In the first experiment, we measured success rate of each method in reading kernel data on various locations. The rate was measured by counting successful reads from L1 and L3 cache in 100,000 attempts for each method. Table 2 shows the experimental result. In the table, 'TSX' refers to the TSX-based method used in the original Meltdown attack. 'Branch instruction' and 'RSB-clflush' refer to the method used in RIDL (and Fallout) and LazyFP attack, respectively. For the proposed method (referred to as 'RSB-dep'), the success rates were measured on two configurations with different length of the dependency chain. As shown in the table, we found no significant differences in the success rate among them. Regarding the latency of reading kernel data, however, it appears that there exists noticeable difference among exception suppression methods. In our second experiment, we measured the latency for each method. The latency was averaged from 100,000 measurements in total, where for each measurement the execution was retried up to 1,000 times until the success in reading a kernel byte. For more explicit evaluation, the experiment was conducted varying the number of transient instructions in the Meltdown code.
As shown in Fig.7 , the proposed method has the lowest latency in reading kernel data with up to 100 transient instructions. Because of the limited length of chain and the capacity of a reorder buffer, the proposed method shows poor performance with more than 100 transient instructions (90 instructions for C = 30).
For the RSB-clflush method, more transient instructions are available since there is no long dependency chain of instructions needed to create transient execution. However, the latency appears to be 100 -200 cycles longer than these of the proposed method. We think that the latency gap between RSB-clflush and RSB-dep actually comes from the different ways of resolving the return address. That is, in RSB-clflush, the return address is resolved by loading from the main memory, which usually takes longer than RSB-dep where it is resolved through calculation.
Other exception suppression methods that use a TSX and branch misspeculation also have low latency. However, the size of transient window that they can create is quite smaller than the RSB-based methods.
C. EXPERIMENTS IN PRACTICAL SCENARIO
To demonstrate the effectiveness of our method, we conducted experiments in some practical attack scenario. Specifically, we evaluate the performance in terms of two kernel attacks: (1) breaking KASLR and (2) extracting a secret string in kernel memory.
The experiments were conducted on various hosts with different processors. Table 3 presents our experimental environments. Three different Intel processors were used, none of which were equipped with TSX.
In the experiments, we used the implementation described in Section V-A, (i.e., the libkdump_read_rsb() function) for our attack. For ease in comparing performances, we used the libkdump_read_signal_handler() function in the libkdump library as an implementation of the signalhandler-based method of the original Meltdown attack.
1) BREAKING KASLR
KASLR is a security mechanism of modern operating systems that protects kernel memory. By randomizing the address space in a kernel, KASLR prohibits attackers from compromising the kernel. It has been recently discovered that KASLR is vulnerable to Meltdown-type attacks [3] . A Meltdown attacker can de-randomize the kernel address and extract a direct physical map offset in the kernel memory.
We conducted an experiment to evaluate the efficiency of the proposed attack in breaking KASLR. For the experiment, we used the PoC (Proof-of-Concept) implementation in the libkdump library [9] that demonstrates KASLR attacks. The PoC attempts to de-randomize KASLR by using Meltdown attacks. Specifically, it iterates reading data at arbitrary kernel address by increasing an offset until it encounters an expected value. The randomized offset is then inferred from both the physical address and the virtual kernel address of that value.
The original PoC code was implemented based on the libkdump library, which only supports either signal-handlerbased or TSX-based method. We slightly modified the code so that our method (i.e., libkdump_read_rsb()) is also available in the experiment.
In the experiment, we measured an averaged execution time of obtaining the valid kernel offset by mounting attacks against target systems. We performed 100 measurements on the same KASLR instance of operating systems and then calculated the average results. To examine the performance in various system conditions, stress-ng was also used in the experiment. As the performance of Meltdown-type attacks are primarily influenced by cache and CPU activities, we executed stress-ng in the background with only two options enabled, which render the cache and CPU busy. stress-ng was executed on the same physical core with the PoC code. Fig.8 shows the experimental results. The graph in the figure presents the averaged execution time in successfully obtaining the direct physical map offset. The experimental results indicate that the proposed attack performed better than the signal-handler-based method of the original Meltdown attack in various conditions.
2) EXTRACTING A STRING IN KERNEL AREA
We evaluated the performance of the proposed attack in extracting a (secret) string in the kernel area. In this experiment, we assumed that a virtual address of the target string is known to attackers. In fact, the attacker can easily determine a virtual address of the target string with knowledge of the direct physical map offset learned by breaking KALSR. More concretely, the virtual address of the string can be calculated by adding a direct physical map offset with its physical address.
For ease in conducting the experiment, we implemented a program that obtains a physical address of the target string. Executing in the privileged mode, the program retrieves the physical address from pagemap. Given information about the direct physical map offset and the physical address, another program that implements the attack attempts to read the target string. We executed the program 100 times and measured the elapsed time to complete the extraction. We also used stress-ng to simulate various workloads by executing it on the same physical core. The experimental results are shown in Fig. 9 . The graph in the figure shows the average elapsed time of a successful extraction of the 256-byte string in kernel memory. This result shows that our attack is more efficient than the signal-handler-based method in extracting strings in the kernel area.
D. APPLICABILITY FOR OTHER ATTACKS
Our RSB-based method proposed in this paper is not limited to the original Meltdown attack but also applicable to other types of transient execution attacks. In this section, we discuss applicability of our method for several attack variants as well as applicability for other CPUs.
1) FORESHADOW ATTACKS
This is a variant of Meltdown-type attacks. While the original Meltdown attack considers a page fault caused by privileged access violation, Foreshadow utilizes different page faults that cause immediate aborting the address translation [13] , [14] . This attack exploits a L1 Terminal Fault (L1TF) vulnerability existing in some Intel CPUs that still continues deriving a physical address and transiently loads data at the address residing in the L1 cache. In our implementation of the Foreshadow attack, we introduced such page fault behavior by clearing the ''present'' bit in a page table entry. The target page is mapped to the same address space of the implementation. We measured the success rate and throughput of reading data in the same experimental environment as in Section V-B. As a result of the experiment, the Foreshadow attack with our RSB-based method shows the success rate of 99.9% and the throughput of 445 KB/s on average under the dependency chain with C = 30.
2) MICROARCHITECTURAL DATA SAMPLING (MDS) ATTACKS
This is a microarchitecture attack that exploits MDS vulnerabilities found even in the recently patched Intel CPUs. The vulnerabilities allow some in-flight data located in the internal buffer of CPU (e.g., the line fill buffer and the store buffer) to be leaked to another process on the same physical core.
Following some MDS attacks such as RIDL [11] and ZomebieLoad [15] , we implemented our attack that uses a line fill buffer as a leakage source. Our implementation is based on ZomebieLoad, where its source code is available on the authors' web site. 2 We modified the code so that our RSB-based method is used for suppressing the exception.
In the experiment, we set up a victim application that repeatedly loads secret data from the memory in an infinite loop. The MDS attack was executed on the sibling logical core trying to read the in-flight data of the victim. The experimental result shows that the attack leaks the secret with the success rate of 96.5% with the throughput of 467 KB/s.
3) APPLICABILITY FOR OTHER PROCESSORS
We also discuss the applicability of the proposed method for other processors. As our work is to extend the Meltdown-type attacks with our RSB-based method, we focus on certain processors vulnerable to Meltdown attacks. To the best of our knowledge, there are no AMD processors that are affected by the Meltdown-type vulnerabilities [16] . Regarding ARM processors, the vendor reported that a certain CPU model is affected by the vulnerability [17] . We did not conduct an experiment on ARM processors due to lack of that model in our lab. Though lacking explicit evaluation, we believe that our method is applicable to the Meltdown-vulnerable ARM processor, since an RSB is commonly used in all processors.
VI. COUNTERMEASURE
In this section, we overview some possible solutions to prevent microarchitectural attacks including the proposed attack.
A. KERNEL PAGE-TABLE ISOLATION (KPTI)
A solution for mitigating our attack is to isolate a kernel page table from the user space. Gruss et al. [18] proposed a kernel protection technique, called KAISER, to protect against side channel attacks that break kernel-level address space randomization (i.e., KASLR). This technique divides the address space of programs into a user address space and kernel address space using a CR3 register. More specifically, it unmaps kernel pages while running in user mode and remaps them when context switching to kernel occurs. Consequently, a transient execution triggered from the user mode code is prohibited to access kernel memory, thus preventing Meltdown attacks. In fact, KPTI has already been applied as a security patch to the latest version of mainstream OSs such as Linux. However, it has been reported that the kernel page isolation can introduce significant performance overhead on the system. This protection technique cannot prevent attacks within the security boundary of the same privilege mode such as accessing process memory outside a sandbox.
B. ATTACK DETECTION
Transient execution attacks cause a substantial amount of cache contention because they internally utilize cache sidechannel techniques to obtain the secret. Such contention can be exploited to build a runtime detection system for Meltdown-type attacks. Anomaly-based detection techniques for cache side-channel adversaries [19] , [20] have been proposed. These techniques typically utilize a hardware performance counter, a hardware component that provides in the number of CPU counters with respect to cache hits/misses in real time, and so on. By utilizing these counters, we can build intrusion detection systems not only for the proposed attack, but also for any type of microarchitectural attacks that result in cache contention.
C. HARDWARE-BASED MITIGATION
Processor vendors also introduced some hardware-based mitigation. To patch the vulnerabilities, vendors have released mitigations through microcode update or direct hardware fixes. For mitigating Meltdown-type attacks, Intel released several microcode updates that prevent speculative cache loads by patching some instructions [21] and flushing L1 cache [22] . For Spectre-type attacks, Intel also released various hardware fixes that prevent polluting branch prediction units. In particular, for attacks exploiting an RSB, Intel implemented RSB refilling on Skylake+ processors, which is a patch that intentionally fills the RSB with benign addresses on every switches into kernel [23] .
VII. RELATED STUDIES
In this section, we present the related studies.
A. TRANSIENT EXECUTION ATTACKS
Transient execution attacks exploit processor optimization techniques such as out-of-order and speculative executions [2] . By causing exceptions or mispredictions, this attack intentionally creates a transient execution on certain instructions. During the execution window, an attempt to access unauthorized data is made across security boundaries. As those instructions are not committed, the processor will restore the architectural state (e.g., the state of registers and memory) affected by the execution result. However, the microarchitectural state (i.e., cache) still remains unchanged even after the rollback. The footprint left in the cache is subsequently decoded into secret information by utilizing cache covert channel techniques [4] , [6] .
According to the taxonomy proposed by Canella et al. [2] , transient execution attacks are classified into Meltdownand Spectre-type attacks. In Meltdown-type attacks, transient executions are initiated by triggering certain exceptions. In most studies, a page fault (#PF exception) is used for the attack; they raise the fault by causing privilege violation [3] and read/write access violations [24] or by clearing the present bit in a page table entry [13] , [14] . In some studies, other exceptions have been utilized, such as general protection fault (#GP) [25] and device-not-available (#NM) exceptions [10] . In this type of attack, a signal handler or hardware feature such as the TSX are typically used to extract secret information through the cache covert channel. As mentioned in Section III-B, however, those methods are restrictive in practical scenarios.
In Spectre-type attacks, a transient execution is created by causing mispredictions rather than exceptions. Various sources of prediction units are used in this type of attack, such as the BTB [26] , pattern history buffer [24] , [26] , [27] , and memory order buffer [28] . Some Spectre variants [29] , [30] utilize an RSB as a source of misprediction. Although they appear similar to our work, they give no details on how to exploit the RSB for exception-based transient executions.
Contrarily, our study is primarily focused on demonstrating the benefits of using the RSB in mounting Meltdowntype attacks compared with using the signal handler and TSX-based method.
B. EXCEPTION SUPPRESSION TECHNIQUES
In Meltdown-type attacks, suppressing exceptions is one of the challenging issues to construct more reliable cache covert VOLUME 7, 2019 channel. Lipp et al. [3] proposed a suppression method by using TSX. As discussed in Section III-B, the TSX-based method has a limitation due to its dependency on specific hardware features.
Stecklina and Prescher [10] proposed an RSB-based exception suppression method in mounting LazyFP attack. The same approach was also proposed by Koruyeh et al. [30] . Unlike our method, they flush the return address from the cache to make sufficient delay to resolve the correct return address. Compared to our RSB-based method, flushing the cache incurs longer execution delay in the Meltdown attack. Wong [31] also proposed an idea, which is called Specpoline, to suppress an exception in Meltdown attacks by using an RSB. However, any implementation details on how to delay the resolution of return address are not presented in their work.
Another method of suppressing exception is to exploit branch misspeculation, which is used in recently discovered microarchitectural attacks such as RIDL [11] and Fallout [12] . As discussed in this paper, it only provides the limited window of transient execution.
C. CACHE SIDE-CHANNEL ATTACKS
Cache side-channel attacks allow an attacker to learn the secret information of a victim by tracing the cache behavior created by the victim [32] . Cache side-channel attacks on various target applications have been proposed in several studies. Most studies presented their attacks against cryptographic implementations to effectively prove their security impact. Yarom et al. showed leaking private keys from an RSA [4] and an ECDSA implementation [33] via the Flush+Reload technique. Shin et al. [34] and Genkin et al. [35] demonstrated that an ECDH implementation is vulnerable to cache-based attacks. Irazoqui et al. [36] and Gulmezoglu et al. [37] presented attacks that extract AES keys using the Flush+Reload technique. The Prime+Probe technique has been utilized in several studies to extract secret keys from various cryptographic algorithms, such as ElGamal [38] , RSA [7] , [39] , and AES [40] , [41] .
In addition to cryptographic algorithms, Lipp et al. [42] and Gruss et al. [6] demonstrated attacks on practical software, such as learning information about user inputs from a touchscreen and keyboard, separately. Zhang et al. [43] reported a method to obtain browsing history from a web browser by exploiting the cache side-channel. Weber et al. [44] presented a method to implement an SSH over a cache-based covert channel.
VIII. CONCLUSION
Meltdown attack exploits the out-of-order execution of modern processors to enable unauthorized kernel access. The original Meltdown attack suffers from two problems: a large amount of system noise and the low usage of processorspecific hardware features, such as the TSX, both of which markedly restrict the effectiveness of the attack. We herein proposed a more effective method for the Meltdown attack using an RSB. The RSB eliminates the need for context switching during the attack process, thus enabling a low-noise cache covert channel without relying on the TSX. The wide usage of RSBs in commodity processors improved the effectiveness of Meltdown-type attacks. We presented an implementation of our attack in detail and evaluated its performance experimentally. Furthermore, we presented some countermeasures to mitigate the attack.
