We investigate how different categories of microarchitectural state on recent ARM and x86 processors can be used for covert timing channels and how effective architecture-provided mechanisms are in closing them. We find that in recent Intel processors there is no effective way for sanitising the state of the branch prediction unit and that, contrary to often held belief, flushing the translation lookaside buffer on Intel processors does nothing to mitigate attacks based on this component. We further show that in both ARM and x86 architectures flushing all the hardware caches is not effective to close cache-based timing channels. The implication of this is that secure sharing of a processor core in these architectures is not possible, irrespective of cost.
Introduction
Computer hardware is increasingly being shared between multiple, potentially untrusted, programs. Examples of such sharing range from cloud services where a single computer may share workloads of multiple clients via mobile phones that run multiple apps, each authored by a different developer, to web browsers displaying pages from different sites. To protect confidential or private information that some of these programs may access, the system imposes a security policy that prevents the dissemination of such information.
One threat to the security of the system are covert channels [30] , which allow colluding programs to bypass the security policy, by transferring information over media that are not controlled by the system. A typical scenario includes two programs: a Trojan program, which has access to sensitive information but is confined [33] by the security policy (i.e. prevented from sending information to arbitrary destinations), and a spy process that does not have access to the sensitive information but can communicate with less restrictions. Using a covert channel, the Trojan can send the sensitive information to the spy, which can then exfiltrate it from the system.
Covert channels are often classified as either storage or timing channels [46] . Storage channels exploit the ability of one program to store data that the other program can read. Timing channels, in contrast, exploit timing variations for transferring information. Past research has demonstrated the possibility of completely eliminating storage channels [39] .
For timing channels, the picture is not that clear. Some classes of timing channels can be eliminated by ensuring deterministic timing of any externally visible effects of programs [10] , [12] , [14] , and mitigation strategies are often suggested for published microarchitectural channels [16] , [41] , [53] . However, there is currently no known method that guarantees the absence of timing channels on shared hardware.
In this paper we examine the degree to which it is possible to prevent timing channels on modern hardware. Specifically we look at intra-core channels, which exploit hardware state for signalling between processes or VMs that time-share a processor core. This means we not only ignore channels between cores, but also between concurrent executions on a single core (hyperthreading); channels between hyperthreads are well-documented and understood [9] and are probably impossible to close. This is thus a fairly restricted scenario and one would expect that all channels could be trivially (albeit expensively) closed by flushing all cached state on a context switch, using the appropriate hardware mechanisms, so the main challenge would seem to be how to minimise the cost of the defence.
However, we show reality to be different: we demonstrate that on recent ARM as well as x86 processors there are channels that resist all attempts to close them by flushing the state they exploit.
Specifically, we implement several covert-channel techniques, including the Prime+Probe attack on the L1 data cache [40] , [42] and the L1 instruction cache [2] , [3] . We also implement new attacks targeting the translation lookaside buffer (TLB), the branch predictor unit (BPU), and the branch target buffer (BTB). We measure the channels created by these techniques, first without mitigations, to demonstrate the existence of the channel, and then with the use of mitigation techniques, to measure the remaining channel.
Our results show that some channels remain even after activating all of the available mitigation techniques. In particular, we note that the x86 does not support any instruction or documented method for clearing the state of the BPU. Consequently, the branch prediction channel remains open.
We further note that popular belief notwithstanding, invalidating the contents of the caches does not close all cache-based channels and that at least on Intel x86, flushing the TLB has negligible effect on the TLB channel.
In summary, we make the following contributions:
• We identify a limited scenario for investigating microarchitectural-timing-channel elimination. (section 3.) • We implement multiple persistent-state microarchitectural covert channels, some of which have previously only been speculated, but never implemented (subsection 4.1), identify existing mitigation techniques available in existing processors (subsection 4.2), and measure the channels with and without those mitigation techniques (section 5). Our results show that on present hardware, intracore channels remain even when using all hardwaresupported flush operations.
Background
We begin by describing the relevant components of modern processors, and how they can be leveraged for timing channels.
The Microarchitecture
The instruction set architecture (ISA) is the hardware-software contract for a processor family, such as x86 or ARM Cortex-A. The ISA specifies the functional operation of the processor, including the instructions that the processor can execute, their encoding, the available registers and their functions.
While there may be some minor variations in feature support between processors in a family, the core of the ISA remains invariant, allowing seamless software support across the family.
The ISA abstracts over a processor's implementation, which is made up from a large number of components, including functional units, caches, buses and interfaces, collectively called the microarchitecture. While functionally transparent, details of the microarchitecture affect the timing of operations. Much of this is the result of the processor caching information in many places, in order to improve average-case execution speed. Any such caches make the latency of operations dependent on execution history and thus creating the potential for timing channels. We now describe the relevant components.
CPU caches These are, in terms of their effect on timing, the most noticeable components. The caches bridge the speed gap between the processor and the much slower memory, by holding recently-accessed data or instructions.
A cache is a bank of high-speed memory, typically using static random access memory (SRAM) technology, which is faster albeit more expensive than the dynamic random access memory (DRAM) technology commonly used in the main memory. Faster technology and greater proximity to the processing core (enabled by smaller size) means that access to the cache is much faster than to the main memory. Caches utilise the spatial and temporal locality of programs for reducing average access time.
Cache organisation The cache is organised in lines of a fixed, power-of-two size, typically ranging from 32 to 128 bytes. The line is the unit of allocation and transfer to memory, i.e. at any time a line either is invalid or caches a size-aligned block of memory. The lowestorder bits of the address of a data item or instruction are the line offset, i.e. they determine where the item is located within the line.
Caches are generally set associative, meaning that a fixed number, k, of lines are grouped into a set, where k is the associativity of the cache; the lines of a set are also often called ways. Cache content is located by hashing the address onto a set number. In most cases the hash is just the low-order address bits after stripping the offset bits. Within the set, the correct line is found by associative lookup, comparing the address bits with a tag stored in each line. If none of the tags match, the item is not in the cache (i.e. a cache miss).
Cache hierarchy As the speed gap between processor and memory is orders of magnitude, modern processors have a hierarchy of caches. Closest to the core is the L1 cache, generally split into separate instruction and data caches, I-and D-cache, and always private to a core. Further levels are unified, larger and slower, down to the last-level cache (LLC), which is generally shared between cores.
Cache addressing L1 caches are frequently virtually addressed, i.e. lookup is by virtual address of the item. All other caches are physically addressed. On recent Intel processors, the LLC lookup uses a more complex hash rather than just the low-order address bits. The hash function is unspecified, but has been reverseengineered [27] , [36] , [57] .
Translation lookaside buffer The mapping from virtual to physical addresses is specified in the page table.
To avoid frequent lookups, the processor caches translations in the TLB. It is usually organised as a two-level cache. Intel processors generally feature set-associative TLBs, while on many other architectures they are fully associative (single set) or a mixture (e.g. a fullyassociative first-level and a set-associative second-level TLB on ARM Cortex A9). Intel processors also feature a separate cache for page directory entries.
Branch prediction To avoid pipeline stalls while processing branch instructions, processors feature a BPU, which predicts the target of branches. This allows the processor to speculatively fetch instructions following the branch. In the case of a misprediction, the speculative execution is rolled back and processing continues on the right branch. A typical BPU consists of at least two subunits: The BTB and the history buffer.
History buffer The history buffer aims to predict the outcome of conditional branches, i.e. whether the branch is taken or not. Prediction is typically based on the history of the specific branch, possibly with a combination of the outcomes of branches leading to it. The history buffer maintains a state machine for each branch (or a combination of a branch and branching history). In the common two-bit predictor, the predictor needs to mispredict twice in order for the prediction to change.
Branch target buffer The BTB caches destination addresses of unconditional and taken conditional branches. Details are generally not specified by the manufacturer, but can frequently be reverseengineered [20] , [38] , [50] .
Prefetching Modern processors increase the effectiveness of the cache by predicting which memory locations will be accessed in the near future, and pre-loading these locations into the cache, a process called prefetching. This works best for constant-stride accesses, but modern prefetchers can deal with more complex access patterns. The exact operation of the prefetcher is generally unspecified.
Microarchitectural Channels
Microarchitectural state The microarchitectural components described above have in common that they maintain some state which is based on prior computation, either by caching raw data or instructions, or by implementing state machines that capture recent execution history. This state is functionally transparent, i.e. it does not affect the results or outcomes of the programs. However, because this state is used to improve the performance of the program, it affects operation timing and is therefore visible through variations in the timing of program executions.
Timing channels Whenever this state is shared between different program executions there is a potential timing channel, as the timing of one program may depend on the execution history of another. In general, a channel will exist unless the state is either strictly partitioned between programs (e.g. if the hardware tags the state with a program ID) or flushed when switching the processor between programs (context switch).
Covert channels If those conditions are not met, then a Trojan can, through its own execution, force the hardware into a particular state and a spy can probe this state by observing progress of its own execution against real time. This will constitute a covert channel, i.e. an information flow bypassing the system's security policy. For example, the Trojan can modulate its cache footprint, encoding data into the number of cache lines accessed. The spy can read out the information by observing the time taken to access each cache line. Or the Trojan can force the branch predictor state machine into a particular state, which the spy can sense by observing the latency of branch instructions (and thus whether they are predicted correctly).
The actual implementations of covert channels depend on the details of the particular microarchitectural feature they exploit. A large number of such implementations have been described, as surveyed by Ge et al. [19] .
Historically, covert channels were mostly discussed within the scope of multilevel security (MLS) systems [31] . Such systems have users with different classification levels and the system is required to ensure that a user with a high security clearance, e.g. a Top Secret classification, does not leak information to users with a lower clearance. The advent of modern software deployment paradigms, including cloud computing and app stores, increased the risk of covert channels and shifted some of the focus from militarygrade systems to commercial and personal environments. Covert channels break the isolation guarantees of cloud environments where workloads of several users are deployed on the same hardware. Similarly, mobile devices rely on the system's security policy to ensure privacy whilst executing software of multiple, possibly untrustworthy developers.
Side channels The threat of microarchitectural channels is not restricted to environments compromised by Trojans. A side channel is a special case of a covert channel, which does not depend on a colluding Trojan, but instead allows a spy program to recover sensitive information from a non-colluding victim. Where they exist, side channels pose a serious threat to privacy and can be used to break encryption [19] .
In general, collusion allows better utilisation of the underlying hardware mechanism and hence covert channels tend to have much higher bandwidth than side channels based on the same microarchitectural feature; the capacity of the covert channel is the upper bound of the corresponding side channel capacity. This means that closing covert channels implicitly eliminates side channels. For that reason we focus on covert channels in this work, as we aim to establish to which degree microarchitectural timing channels can be eliminated. We note that any timing channel that allow the spy to obtain address information from the Trojan (i.e. which data or instructions it accesses) can potentially establish a side channel .
Prime+Probe Prime+Probe [40] , [42] is a specific and commonly used technique for exploiting setassociative caching elements as cache-based timing channels. It has been applied to the L1 D-cache [40] , [42] , L1 I-cache [7] , and the LLC [34] .
Using the technique, the spy primes the cache by filling some of the cache sets with its own data. The Trojan uses each of the sets that the spy primes to transmit one bit. For clear bits, the Trojan leaves the spy's data in the set. For set bits, the Trojan replaces the spy's data with its own. The spy then probes the cache state to receive the information. It measures the time it takes to access the data it originally stored in them. A long access time indicates that some data in the cache set was replaced between the prime and the probe stages therefore the corresponding bit is set. A short access time indicates that the data is still cached thus the corresponding bit is clear.
Defences
A microarchitectural timing channel can be eliminated if the underlying hardware state is either strictly partitioned or flushed. Partitioning is possible, e.g. in physically-addressed caches, using memory colouring [32] , [49] . This utilises the fact that in an associative cache, any particular memory location can only be cached in a specific set. The OS can allocate physical memory to security domains such that they cannot compete for the same cache sets.
Where partitioning is not possible, e.g. in virtuallyaddressed caches, such as the TLB, or where it is not possible to associate state with domains, as may be the case in the state machines used in the branch predictor or prefetcher, state must be flushed on a context switch. Architectures generally provide instructions for flushing caches, but not for all of the other state.
A frequently-suggested defence is injecting noise, e.g. via random perturbations of the state [54] , [59] . While the approach reduces the usable capacity of channels, it cannot eliminate the signal (unless the noise anti-correlates with the transmitted signal) and the cost of reducing the signal-noise ratio quickly becomes prohibitive [14] . Furthermore, there are sophisticated channel implementations that are robust against noise [35] and covert channels have been demonstrated to work in noisy environments [37] .
Another countermeasure frequently suggested is to fuzz the clock [25] , [51] or to reduce its resolution [1] , [13] . We note that fuzzing the clock just introduces noise into the system, and thus has the same limitations as other ways of adding noise. Covert channels have been implemented even in the absence of highresolution clocks [14] . Hence, these methods cannot completely eliminate covert channels and are therefore not suitable for our purposes.
Our aim is not to minimise the cost of channel mitigation, but rather to establish whether they can be closed at all. We therefore use the (costly) brute-force approach of flushing any state that the hardware allows us to flush on each context switch.
seL4
Covert channels exploit shared hardware features to provide information flow between programs running in different security domains. As such they are independent of the operating system (OS) or hypervisor separating the domains. The only role the OS or hypervisor plays is in applying mitigations, trying to close the channels. Therefore, the actual OS or hypervisor used on the platform is of little importance, other than that it must allow implementation of the mitigations.
For our experiments we use the seL4 microkernel [47] , for a number of reasons. First, seL4 is a small (about 10,000 lines of code) and simple system, which makes it relatively easy to implement mitigations, compared to a large, complex system such as Linux. Furthermore, seL4 is specifically designed for use in security-critical system, and has undergone comprehensive formal verification, with proofs of implementation correctness and security enforcement [29] . In particular, seL4 has been proved free of storage channels [39] , which means any remaining channels must be timing channels, simplifying analysis of results.
seL4 can be used as the basis of a general-purpose OS, as a separation kernel [45] or as a hypervisor. In this work we use it as a separation kernel. This means that our setup contains the minimum amount of software, with our attack code running in a minimal environment directly on top of seL4, with no actual OS services. This avoids any interference from other software components. Note that using seL4 as a hypervisor would expose the same microarchitectural channels, but possibly more. Demonstrating channels in the separation-kernel setup implies more generality, as the same channels will exist in the virtualisation setup.
Threat Model
In our threat model we assume that the adversary manages to executes a Trojan program within a secure environment. For example, the adversary may compromise a service within the secure environment and inject the Trojan's code. Alternatively, the adversary may be, or may control, a corrupt developer that inserts malicious code into a software product used in the secure environment, e.g. an app used to process private data on a smartphone. Executing within the secure environment gives the Trojan access to sensitive data which the adversary wants, however the security policies at the secure environment prevent data exfiltration by the Trojan.
Additionally, the adversary controls a spy program which executes on the same computer as the Trojan, for example, in a different virtual machine. The spy is not executing within the same secure environment and consequently can communicate freely with the adversary, but it does not have access to the sensitive data. The adversary's aim is to exploit a microarchitectural covert channel in the shared hardware. If such a channel exists and is not controlled by the system, the Trojan can use the channel to send the sensitive data to the spy, which can then send the data to the adversary. In this work we check whether the system can prevent the adversary from exploiting such covert channels.
As indicated in subsection 4.1, we focus in this work on channels that can be exploited by Trojan and spy time-sharing a processor core. This allows us to ignore transient-state channels [9] , [58] , i.e. those that exploit the limited bandwidth of processor components. Transient-state channels rely on concurrent execution of the Trojan and the spy and are therefore automatically excluded in the time-sharing scenario.
We thus only need to handle persistent-state channels, which rely on exhausting the storage capacity of processor elements. In a typical persistent-state channel, the spy sets the targeted component to a known state. The Trojan executes, modifying the state based on the data to transmit. The spy then uses operations whose timing depends on the state of the component to measure the modifications the Trojan made and recover the data. Because persistent-state channels require storage in the targeted element, they often target caching elements. Example of targeted elements include the data caches [34] , [40] , [42] , instruction caches [2] , [3] , TLB [26] , and BPU [5] , [6] , [18] .
As we are exploring microarchitectural channels, we ignore timing channels that are controlled by software. For example, the Trojan could create a timing channel by varying its execution time before yielding the processor. We note that the system can protect against such channels by padding the execution time of the Trojan following a yield [10] , [12] , [14] . Moreover, because we investigate the processor's ability to close the channel, we only investigate channels within the processor itself. External channels, such as the DRAM open rows channel [44] , are outside the scope of this work.
Methodology
In this work we examine the level of support that manufacturers provide for eliminating microarchitectural timing channels in their processors. To this purpose, we implement multiple covert channels, identify the processor instructions and available information that can be used for mitigating the channels, and measure the capacity of the channel with and without the mitigation techniques. These steps are described in greater details below.
Channels
Following Cock et al. [14] , we view a channel as a pipe into which a sender (the Trojan) places inputs drawn from some set I and which a receiver (the spy) observes as outputs from a set O. Both the inputs and the outputs depend on the specific covert channel used.
We implement four channels, each designed to target a specific microarchitectural component. But note that as these components are not isolated from the rest of the processor, the channels are affected by components other than those targeted.
We target the following channels: L1 data cache This channel uses the Prime+Probe attack technique described in subsection 2.2 on the L1 D-cache [40] , [42] . The input symbols consist of numbers between 0 and the number of sets in the cache. To send a symbol s, the Trojan reads data to fill cache sets 0, 1, . . . , s, completely filling these sets. The spy performs the Prime+Probe attack by first filling the whole cache with its own data and then measuring the time to read the data from each cache set. The output symbol is the sum of the read measurements. For the implementation we adapt the L1 Prime+Probe attack of the Mastik toolkit [56] to the processors we use.
Note that we could use a more sophisticated encoding of the input symbols to increase capacity. However, the point is not to establish the maximum channel capacity, but to investigate whether it can be closed. We therefore keep things as simple as possible. L1 instruction cache Here we use the Prime+Probe attack on the L1 I-cache [2] , [3] . The approach is identical to the L1 D-cache channel, except that instead of reading data, the programs execute code in memory locations that map to specific cache sets. The implementation also uses an adaptation of the Mastik code. Translation lookaside buffer To implement a TLBbased channel, our Trojan sends an input symbol consisting of a number between 0 and 128 (the size of the ARM TLB and twice the size of the x86 TLB). To send a symbol, s, the Trojan reads a single integer from each of s pages. The spy measures the time to access a number of pages. In order to reduce self-contention in the spy, it only accesses half of the TLB (64 or 32 pages, respectively).
A more sophisticated design would take into account the structure of the TLB and aim to target the individual associative sets. As before, we opted for simplicity rather than capacity.
The only prior implementation of a TLB-based channel is [26] , which use an intra-process TLB side channel to bypass the protection of kernel address 1 #define X_4(a) a; a; a; a 2 #define X_16(a) X_4(X_4(a)) 3 #define X_256(a) X_16(X_16(a)) 4 5 #define JMP jnc 1f; .align 16; 1: space layout randomisation (KASLR). We are not aware of any prior implementation of inter-process TLB channels and past work consider such channels infeasible because the TLB is flushed on context switch [60] . Branch prediction The branch prediction channel exploits the processor's history buffer. In each time slice, the Trojan sends a single-bit input symbol. Both the Trojan and the spy use the same channel code for sending and receiving. The code, shown in Figure 1 , consists of a sequence of conditional forward branches that are always taken (Line 8). These set the history to a known state. The next code segment (Lines 10-17) measures the time it takes to perform a branch (Line 13) that conditionally skips over 256 nop instructions (Line 14). The branch outcome depends on the least significant bit of register %edi. The return value of the code (register %eax) is the measured time.
Because the code takes different paths depending on whether the input bit is set, there would be a timing difference between these two paths. Another source of timing difference is the processor's prediction on whether the branch in Line 13 is taken. The channel exploits this timing difference.
For the channel implementation, the Trojan and the spy map the code at the same virtual address. Note that there is no need for the Trojan and the spy to use the same copy of the code (i.e. we do not rely on shared memory). To send an input symbol, the Trojan sets the least significant bit of the input register to the input symbol and repeatedly calls the code throughout the time slice. This sets a strong prediction of the branch's outcome. The spy, in its time slice, calls the code with the input bit cleared (branch taken) and uses the measured time as the output symbol.
The channel is similar to, and is inspired by, the BTB channel of [18] . To adapt the channel to target the history buffer, we added a sequence of branches to set the branch history to a known state and replaced the original unconditional branch with a conditional branch. Prior channels that target prediction of conditional branches rely on the interaction between history buffer and the BTB [4] , [8] . To the best of our knowledge, we are the first to directly target the history buffer.
We implemented this channel only on the Intel architecture so far. Branch target buffer To build a BTB channel on the ARM Cortex A9 platform, we chain jmp instructions into a probing buffer, which is the same size of the branch-target address cache. The sender probes on 0 to 512 instructions, while the receiver times probing the entire buffer. To our knowledge, this is the first demonstration of using the BTB as a covert timing channel.
Mitigations
The microarchitectural channels we demonstrate in this work exploit timing variations due to the internal state of processor components. One way of closing these channels is for the OS to force the hardware into a defined state at context-switch time, to make it independent of prior computation.
Processors typically provide instructions to reset some of the internal state. For example the Intel x86 WBINVD instruction invalidates all of the entries in the processor caches [28] . Consequently, after the instruction executes, accesses to data and instructions are served from memory even if they were previously cached. Hence, executing the instruction erases internal state in the cache. Similarly, reloading the CR3 register when executing in 32-bit mode of the x86 processor invalidates all of the entries in the TLB and pagingstructure caches except global mappings. For flushing the global mappings, we reload the CR0 register.
An alternative to resetting state is to avoid create it in the first place. One of the contributions of this work is demonstrating that the state of the cache prefetcher can be exploited for a covert channel. Neither the x86 architecture nor ARM processors support instructions that reset the prefetcher state. However, both architectures allow the system to disable prefetching, thereby eliminating the state (although at a significant performance cost).
When neither alternatives exist, the operating system can attempt to programmatically reset the state of a component during a context switch. For example, the context switch code can fill the L1 cache with kernel data, ensuring that the contents of the cache does not depend on prior computation. The main problem with this approach is that the processor manufacturers only publish limited information about the internals of their processors. While several researchers have reverse-engineered some of these structures [36] , [38] , [43] , [57] and the algorithm used [11] , [22] , [55] , there is still a lack of concrete information about the implementations. Any defences that depend on such undocumented microarchitectural properties are inherently brittle and may fail to work as soon as a new version of the processor hits the market.
In this work we only employ defences available in the processors. More specifically, for Intel processors we use the WBINVD instruction to flush the caches, we update registers CR3 and CR0 to flush the TLB and we update MSR 0x1A4 to disable the prefetchers [52] . On ARM platforms, we use the hardware provided operations to flush all levels of caches, branch target address cache, and TLBs. On ARM Cortex A9, we also disable prefetchers and the branch prediction unit.
Measuring the channel capacity
To check whether a channel is closed we configure the Trojan to send a pseudo-random sequence of input symbols. The spy collects the output symbols for each of the input symbols. A channel exists if the distribution of the output symbols depends on the input symbol.
The channel matrix. To measure the capacity of the channel, we use the technique of Cock et al. [14] . That is, we first create a channel matrix which specifies the probability of an observed output symbol given an input symbol. For example, Figure 2 shows the channel matrix we observe from our implementation of the L1 I-cache of an ARM Cortex A9 processor. As we can see, there is a strong correlation between the input symbol (number of cache sets that the Trojan occupies after each run) and the output symbol (total time to jump through every cache line in a cache-sized buffer). Consequently, there is a clear channel in this case. In the absence of a channel we would expect to see no horizontal variations in the graph.
Channel capacity As a measure of the channel capacity, we calculate the Shannon capacity [48] , denoted by C, from the channel matrix. This capacity indicates the average number of bits of information that a computationally unbounded receiver can learn from each input symbol. For the channel matrix in Figure 2 we find that C = 4.0 b. Because the Trojan sends one of 257 possible values (0 · · · 256), the maximum capacity expected is 8.0 bits per symbol. The observed capacity is smaller than the maximum due to the partial overlap between the distributions of the outputs for some input symbols, meaning that not all inputs can be unambiguously distinguished.
Channel bandwidth. The channel bandwidth can now be calculated by multiplying the capacity by the input symbol rate, i.e. the frequency with which the Trojan can send different symbols. As the processor is timeshared between the Trojan and the spy, the Trojan can send at most one symbol per time slice. In the case of Figure 2 we use a time slice of 1 ms, meaning that per second the Trojan gets 500 opportunities to send a symbol, as the two processes run in alternating time slices. The potential bandwidth of the channel is therefore 500 · 4 = 2, 000 b/s. Much higher bandwidths have been demonstrated in the literature [17] , [23] , [34] , [37] . One reason for our relatively low bandwidth is that our channel is between programs that time-share a processor, whereas the other channels were observed between concurrently executing processes, allowing for a much higher symbol transfer rate. Furthermore, as indicated in subsection 4.1, we use an unsophisticated encoding, where the input is simply the number of cache sets accessed, resulting in 257 possible values. Instead we could map each bit of the input symbol's binary representation onto a cache set, resulting in 2 256 values and a theoretical maximum capacity of 256 bits per symbol (although the actually achievable capacity would be much smaller, given the spread of the distribution). Instead of aiming for high capacity, our channels are designed to facilitate analysis.
Low-capacity channels A challenge for our channel capacity measurement is that it is based on a sampled distribution. Sampling errors may result in an apparent non-zero channel, even when in reality there is no channel. Consequently, when the computed channel capacity C is low, we need to determine whether we observe a low-capacity channel or just sampling noise. For example, invalidating the cache during a context switch significantly reduces the capacity of the L1-Instruction channel. The resulting channel matrix ( Figure 3 ) appears almost uniform and the calculated channel capacity is C = 0.70. We now need to distinguish between the null hypothesis that states that there is no channel, i.e. the distribution of output symbols is independent of the input symbols, and the alternative hypothesis that there is a channel. If we assume the null hypothesis, then all of the output symbols we collect are drawn from a single distribution (i.e. no dependence on the input symbol). In such a case, the way these output symbols are distributed along the input symbols is just the luck of the draw. We are now interested in estimating the likelihood that drawing a random sample out of this distribution would yield a channel capacity C or higher. Following Cock et al. [14] , we randomly distribute the output symbols collected between the input symbols and measure the capacity of this simulated channel. We repeat the process 1000 times, generating 1000 simulated capacities. If our measured sample (Figure 3 ) is drawn from a single distribution, there is a high likelihood that some of the simulated capacities would be higher than C. Thus, if C is bigger than the maximum of the 1000 simulated capacities C max , the probability that the sample is drawn from a single distribution is less than 0.1% and we reject the null hypothesis. Otherwise, we conclude that we cannot reject the null hypothesis and that the test is inconclusive (i.e. consistent with no channel).
Visualising low-capacity channels Visualising the channel matrix as a heat map is useful for presenting the wealth of information that the matrix contains. However, for low-capacity channels, such a representation may fail to highlight the channel. When the matrix does not clearly show the channel, we graph the average value of the output symbols for each input symbol. Figure 4 shows this graph for the channel matrix of Figure 3 . As we can see, there is a clear difference between the average output symbol for small vs. large input values. Figure 4 . Average probe times, with 95% confidence intervals, for the L1-I covert channel on ARM Cortex A9 with cache invalidation. n = 3823, C = 0.70, C max = 0.38.
Results

Evaluation platforms
We examine processors from the two most widelyused architectures, x86 and ARM. On each we pick representatives of two recent generations of the architecture. For x86 these are the Sandy Bridge and Haswell microarchitectures. For ARM we use a Cortex-A9, which is an out-of-order (OoO) implementation of version 7 of the architecture (ARMv7), and a Cortex-A53, an in-order (InO) implementation of ARMv8, the latest version of the architecture. We summarise their relevant features in Table 1 . Note that the table does not contain information on the BPUs, as there is limited information available and we do not rely on any assumptions on the way they operate.
On each architecture we test each applicable channel with and without the countermeasures. Table 2 shows the unmitigated channel capacities. We see that most of the channels leak at least one bit per input symbol. This is to be expected-all of these channels have been reported and demonstrated in the past. More surprisingly, we find that many of the channels are not completely closed even if we apply all of the available countermeasure, as shown in Table 3 . In fact, for each processor we find at least one channel that leaks over half a bit per symbol. Table 3 . Mitigated channel capacities and maximum simulated capacities (bits). Channels that cannot be mitigated with all hardware-provided operations are marked in red. In some cases the explanation for the remaining channels is straightforward. For example, Intel architectures do not support any method of clearing the state of the BPU. Consequently, the branch prediction channel remains unchanged even when we enable all of the protection provided by the processor. In other cases the story is more intricate. We now look at some examples.
ARM Cortex A9 I-cache channel
We now continue the investigation of the ARM Cortex A9 L1 I-cache channel that we started in subsection 4.3. Recall that Figure 4 shows that even when we flush the caches, we can see a horizontal transition between two distinct distributions, indicating that a channel exists.
In the ARM Cortex A9, the distance between two addresses that map to the same cache set, known as the cache stride (and equal to the cache size divided by associativity), is 8 KiB. Clearly, the transition in Figure 4 occurs at 4 KiB, which matches the page size. This may indicate that the channel originates not from the cache itself, but from some part of the virtual memory unit, for example, from the TLB. Hence, clearing the TLB can, potentially, eliminate this channel.
When applying all of the countermeasures available on the ARM Cortex A9 processor, including flushing the caches, BTB and TLB and disabling the prefetcher, we get the channel matrix in Figure 5 . The channel is still significant, as is clearly evident from Figure 6 . Input values smaller than 14 result in below-average output symbols, whereas symbols in the range 15-50 produce above-average output. While the channel matrix demonstrate the existence of a channel, it does not show the cause of the channel.
One possible explanation for the channel is that the processor maintains some state that is not cleared by flushing the caches and is not deactivated when disabling the prefetcher. An alternative explanation is that the state is maintained outside the processor, for example resulting from a DRAM open-row channel [44] .
To investigate further, we look at different indicators: We use the performance monitoring unit (PMU) to count the L1 I-cache refill operations executed during the probe. As we have the prefetcher disabled, these refills should be independent of timing. 
Sandy Bridge L1 I-cache channel
Without countermeasures, the channel that exploits this cache behaves as expected, as is evident from Figure 8 . Because this is a cache channel, we can reasonably expect that invalidating the caches on the context switch would eliminate the channel. As in the case of the ARM Cortex A9 processor, the channel matrix with this countermeasure (Figure 9 ) shows a much reduced channel. However, as on the ARM Cortex A9, the channel matrix still shows horizontal variation, indicating a small but definite remaining channel. We further evaluate the channel when flushing the TLB and disabling the prefetcher, but a distinct channel still remains, as shown in Figure 10 . 
The effects of the Haswell prefetcher
On the Haswell microarchitecture, the L1 I-cache channel is comparatively small, see Figure 11 . Surprisingly, disabling the prefetcher increases the capacity of the channel (Figure 12 ). It seems that between Sandy Bridge and Haswell, Intel modified the prefetcher and possibly the branch predictor, leading to better masking of L1-cache latencies, thus reducing the effectiveness of the attack. Enabling all mitigations fails to close the channel, but still decreases it, from a capacity of 0.65 b to 0.25 b. The channel is clearly evident in Figure 13 
Branch prediction channel
We now turn our attention to the branch prediction channel. Recall that in the channel implementation (subsection 4.1), there are only two potential input values, 0 and 1, corresponding to a branch taken and not-taken. Figure 14 shows the distribution of output values for each of these input values both without and with mitigations. We first note that for both cases, the distribution of the output symbols for inputs 0 and 1 are clearly distinct. For the non-mitigated case, the median output value for input 0 is 36 cycles whereas the median output for input 1 is 60. Mitigation changes the access times because code now needs to be brought from memory, rather than from the L1 I-cache. However, the output values for inputs 0 and 1 are even Like the Sandy Bridge processor, the Haswell processor ( Figure 15 ) shows clearly distinct output distributions for the different input symbols. Unlike the Sandy Bridge Processor, on Haswell we do not see such a large difference between the output values for the mitigated and the non mitigated cases. We believe that the Haswell prefetches soon-to-be-executed instructions even when prefetching is disabled.
TLB channel
The last channel we investigate uses the TLB. On the Sandy Bridge architecture, Figure 16 shows a very distinct channel, despite the non-global TLB entries being flushed on the context switch due to updating the CR3 register (we run in 32-bit mode where the TLB is untagged and flushed by the hardware on each context switch). This is in contrast to the common belief that the TLB channel is not a threat to virtualised environments because of the mandatory flush [60] . For good measure, we explicitly flush global entries as well, but the effect is minimal as shown in Figure 17 . Surprisingly, invalidating the cache does remove most of the channel, leaving only a small residual channel, as shown in Figure 18 .
Discussion
As we can see from the results, deploying all of the available method of processor state sanitisation still leaves high capacity channels. The countermeasures we deployed are often suggested for mitigating the exact channels we use [21] , [40] , [60] . Yet, in contrast with the popular belief, we find that despite some being prohibitively inefficient [21] , [24] , these countermeasures fail at eliminating the channels. We further find that none of the channels is completely eliminated even when we deploy all of the available countermeasures. The capacity of the residual channels may be small, but they still exist.
As an example, the capacity of the residual Intel Sandy Bridge L1-D channel (Figure 19 ) is 0.038, with a potential error of up to 0.025. That means that on average, a computationally unbounded adversary will require between 26 and 40 input symbols to transfer a single bit. With a transfer rate of 500 symbols per second, the bandwidth of the channel is at most 19 bits per second.
While such a capacity may seem small and insignificant, we note that, as we indicated earlier, we did not build the channel to achieve high capacity. Consequently, further engineering is likely to better exploit the underlying source of leakage and achieve a much higher capacity. Moreover, for high-security systems, even channels with capacities below one bit per second may pose a threat. For example, the Orange Book [15] recommends that channels with a bandwidth above 0.1 bits per second are audited.
The main issue with these low-capacity residual channels is that we do not understand them. Evidently, there is some state within the processor, but we do not know what this state is and how to manage it. Consequently, there is a real possibility that better understanding of this state will enable higher-bandwidth exploits. The only way to rule out such a possibility is through understanding the root cause of the channel.
Conclusions
In this work we investigate intra-core covert channels in modern CPUs. We implemented five different covert channels and measure their capacity on two microarchitecture implementations of each of the two most popular ISAs, x86 and ARM. We identified processor tools to mitigate these covert channels, but demonstrated that these tools are not sufficient. We find and that high-capacity channels remain in every architecture, even when implementing the most drastic (and expensive) countermeasures.
It goes without saying that even if we were able to fully close a channel, this would not guarantee that there is no other hardware state that could be exploited, or that more sophisticated exploits of the same state would not succeed.
We therefore have to conclude that, in the absence of improved architectural support for covert channel mitigation, these modern processors are not suitable for security-critical uses where a processor core is timemultiplexed between different security domains. This work only explores the tip of the iceberg. We have limited ourselves to intra-core channels in a timesharing scenario. In doing that we ignored all transientstate covert channels attacks and all attacks that rely on state outside the processor.
The inevitable conclusion is that security is a losing game until the hardware manufacturers get serious about it and provide the right mechanisms for securely managing shared processor state. This will require additions to the ISA that allow any shared state to be either partitioned or flushed.
