Oblivious RAM (ORAM) is an established cryptographic technique to hide a program's address pattern to an un trusted storage system. More recently, ORAM schemes have been proposed to replace conventional memory controllers in secure processor settings to protect against information leakage in external memory and the processor I/O bus.
Introduction
As cloud computing becomes increasingly popular, pri vacy of users' sensitive data is a huge concern in compu tation outsourcing. In an ideal setting, users would like to "throw their encrypted data over the wall" to a cloud ser vice that can perform arbitrary computation on that data, yet learn no secret information from within that data.
One candidate solution for secure cloud computing is to 978-1-4799-3097-5/14/$31.00 ©2014 IEEE use tamper-resistant/secure processors. In this setting, the user sends hislher encrypted data to trusted hardware, in side which the data is decrypted and computed upon. After the computation finishes, the final results are encrypted and sent back to the user. Many such hardware platforms have been proposed, including Intel's TXT [11] (which is based on the TPM [34, 1] ), eXecute Only Memory (XOM) [18] , Aegis [33] and Ascend [7, 39] .
While it is assumed that adversaries cannot look inside tamper-resistant hardware, secure processors can still leak information through side channels. Preventing information leakage over the memory I/O channel, for example, is a hard problem. Even if all data stored in external memory is en crypted to hide data values, the memory access pattern (i.e., read/write/address tuples) can still leak information [4] ].
Completely preventing access pattern leakage requires the use of Oblivious RAM (ORAM). ORAMs were first proposed by Goldreich and Ostrovsky [9] , and there has been significant follow-up work that has resulted in more efficient, cryptographically-secure ORAM schemes [24, 23, 5, 4, 10, 16, 37, 29, 32] . Conceptually, ORAM works by maintaining all of memory in encrypted and shuffled form. On each access, memory is read and then reshuffled. Thus, any main memory access pattern is computationally indistinguishable from any other access pattern of the same length.
Recently, ORAM has been embraced in secure proces sor designs [7, 26, 19] . These proposals replace a con ventional DRAM controller with a functionally-equivalent ORAM controller that makes ORAM requests on last-level cache (LLC) misses. This direction is promising, and al lows large secure computations (whose working sets do not fit in on-chip cache) to be performed on the cloud with rea sonable overheads.
Problems
1.1.1. When ORAM is accessed leaks privacy. Consider the malicious program that runs in an ORAM-enabled se cure processor in Figure 1 (a). In this example, at ever . y time step t that an ORAM access can be made, the maLI cious program is able to leak the t th secret bit by coercing a Last Level Cache (LLC) miss if the t th secret bit equals 1.
Even when the program is not intentionally malicious, how ever, the nature of on-chip processor caches make ORAM access rate correlate to access pattern locality. is data-dependent to some extent. A serious security issue is that we don't know how much leakage is possible with these programs. If we place security as a first-order con straint, we have to assume the worst case (e.g., Figure 1 An important point is that monitoring ORAM rate as sumes a similarly-capable adversary relative to prior works, i.e., one who can monitor a secure processor's access pat tern can usually monitor its timing. We describe a mecha nism to measure a recent ORAM scheme's timing in § 3.2. 
'iii � (top) perlbench accesses ORAM 80 times more frequently on one input relative to another. In Figure 2 (bottom), for one input to astar a single rate is sufficient whereas the rate for the second input changes dramatically as the program runs.
Our Solution: Leakage Aware Processors
Given that complete timing channel protection is prohib itely expensive ( § 1.1.
2) yet no protection has unknown se curity implications ( § 1.1.1), this paper makes two key con tributions. First, we develop an architecture that, instead of blocking information leakage completely, limits leakage to a small and controllable constant. We denote this con stant L (where L ;::: 0), for bit leakage Limit. Second, we develop a framework whereby increasing/decreasing L, a secure processor can make more/less user data-dependent performance/power optimizations. In short, we propose mechanisms that allow a secure processor to trade-off infor mation leakage and program efficiency in a provably secure and disciplined way.
L can be interpreted in several ways based on the litera ture. One definition that we will use in this paper is akin to deterministic channels [31] : "given an L-bit leakage limit, an adversary with perfect monitoring capabilities can learn no more than L bits of the user's input data with probability 1, over the course of the program's execution." Crucially, this definition makes no assumption about which program is run on the user's data. It also makes no assumption about which L bits in the user's input are leaked.
Importantly, L should be compared to the size of the en crypted program input provided by the user -typically a few Kilobytes -and not the size of the secret sYlmnetric key (or session key) used to encrypt the data. As with other secure processor proposals (e.g., Ascend or Aegis [33] ), we assume all data that leaves the processor is encrypted with a session key that is not accessible to the program ( § 5). This makes the ORAM access rate (for any program and input data) independent of the session key.
To simplify the presentation, we assume a given proces sor is manufactured with a fixed L. We discuss (along with other bit leakage subtleties) how users can set L per-session in § 10. Figure 1 (a), which shows PI leaking T bits in T time. On the other hand, ac cessing ORAM at an offline-selected, periodic rate ( § 1.1.2) yields exactly 1 distinct trace after being run for T time.
Calculating & Bounding Bit Leakage
As expected, this scheme leaks 19 1 = 0 bits through the ORAM timing channel.
As architects, we can use this leakage measure in several ways. First, we can track the number of traces using hard ware mechanisms, and (for example) shut down the chip if leakage exceeds L before the program terminates. Second (our focus in this paper), we can re-engineer the proces sor such that the leakage approaches L asymptotically over time.
An Overview of Our Proposal
At the implementation level, our proposal can be viewed as two parts. First, using ideas similar to those presented in [2] and [20] , we split program execution into coarse-grain time epochs. Second, we architect the secure processor with learning mechanisms that choose a new ORAM rate out of a set of allowed rates only at the end of each epoch. Figure 3 illustrates the idea. We denote the list of epochs, or the epoch schedule ( § 6), as E; the list of allowed ORAM rates is denoted R. Each epoch is denoted by its length, in cycles. To choose a good ORAM rate in R for each epoch, we architect a learning module ( § 7) inside the secure processor called a rate learner (see Figure 3) . Rate learners are cir cuits that, while the program is running, determine how well (in terms of power/performance) the system would perform if it ran using different ORAM rates. In Figure 3 , the rate learner is able to decide that in Epoch 0, the program is ac cessing ORAM too frequently (i.e., we are wasting energy).
Correspondingly, we slow down the rate in Epoch 1.
Background: Path ORAM
To provide background we summarize Path ORAM [32] , a recent Oblivious-RAM (ORAM) scheme that has been built into secure processors. For more details, see [26, 19] .
We will assume this scheme for the rest of the paper, but point out that our dynamic scheme can be applied to other ORAM protocols.
Path ORAM consists of an on-chip ORAM controller and untrusted external memory (we assume DRAM). The ORAM controller exposes a cache line request/response in terface to the processor like an on-chip DRAM controller.
Invisible to the rest of the processor, the ORAM controller manages external memory as a binary tree data structure.
Each tree node (a bucket) stores up to a fixed number (set at program start time) of blocks and is stored at a fixed lo cation in DRAM. In our setting, each block is a cache line.
Each bucket is encrypted with probabilistic encryption 2 and is padded with dummy blocks to the maximum/fixed bucket size.
At any time, each block stored in the ORAM is mapped (at random) to one of the leaves in the ORAM tree. This mapping is maintained in a key-value memory that is inter nal to the ORAM controller. Path ORAM's invariant is: If block d is mapped to leafl, then d is stored on the path from the root of the ORAM tree to leafl (i.e., in one of a sequence of buckets in external memory).
ORAM Accesses
The ORAM controller is invoked on LLC misses and evictions. We describe the operation to service a miss here.
On an LLC miss, the ORAM controller reads (+ decrypts) 
Measuring Path ORAM Timing
If the adversary and secure processor share main mem ory (e.g., a DRAM DIMM), a straightforward way to mea sure ORAM access frequency is for the adversary to mea sure its own average DRAM access latency (e.g., use perfor mance counters to measure resource contention [20, 35] ).
Even in the absence of data-dependent contention and counters, however, the adversary can accurately determine 
This attack assumes that the secure processor's main memory can be remotely read (i.e, through software) by an adversary, which we believe is a realistic assumption. Much focus is given to protecting physical DRAM pages in [II] (i.e., in the presence of DMA-capable devices, GPUs, etc.
that share DRAM DIMMS). This indicates that completely isolating DRAM from malicious software is a challenging problem in itself. For example, bugs in these protection mechanisms have resulted in malicious code performing DMAs on privileged memory [38] . Of course, an insider that is in physical proximity can measure access times more precisely using probes.
Threat Model
Our goal is to ensure data privacy while a program is running on that data in a server-controlled (i.e., remote) se cure processor. The secure processor (hardware) is assumed to be trusted. The server that controls the processor is as sumed to be curious and malicious. That is, the server wants to learn as much as possible about the data and will interfere with computation if it can learn more about the user's data by doing so.
Secure Processor Assumptions
The secure processor runs a potentially malicious/buggy program, provided by the server or user, on the user's data. The secure processor is allowed to share exter nal resources (e.g., the front-side bus, DRAM DIMMs) with other processors/peripherals. As with prior ORAM work [7, 26, 19] , we assume that the secure processor runs a program for one user at a time and that the processor per forms ORAM accesses ( § 3.1) in place of DRAM accesses on LLC misses/evictions. Adversaries that monitor shared on-chip resources (e.g., pipeline [6] , cache [35] ) are out of our scope. We give insight as to how to extend our scheme to cache timing attacks in § lO.
ORAM ensures that all data sent on/off chip is automat ically encrypted with a synunetric session key. We assume that this key cannot be accessed directly by the program.
For timing protection, we additionally require that all en cryption routines are fixed latency.
Monitoring The Secure Processor
The server can monitor the processor's 110 pins, or any external state modified through use of the 110 pins (i.e., us ing techniques from § 3.2). 110 pins contain information about (a) when the program is loaded onto the processor and eventually terminates, (b) the addresses sent to the main memory and data read from/written to main memory, and (c) when each memory access is made. For this paper, we will focus on (a) and (c): we wish to quantify ORAM timing channel leakage and how a program's termination time im pacts that leakage. We remark that ORAM, without timing protection, was designed to handle (b).
Malicious Server Behavior
We allow the server to interact with the secure processor in ways not intended to learn more about the user's data.
In particular, we let the server send wrong programs to the secure processor and perform replay attacks (i.e., run pro grams on the user's data multiple times). Our L-bit leakage scheme (without protection) is susceptible to replay attacks:
if the server can learn L bits per program execution, N re plays will allow the server to learn L * N bits. We introduce schemes to prevent this in § 8. Finally, we do not add mech anisms to detect when/if an adversary tampers with the con tents of the DRAM (e.g., flips bits) that stores the ORAM tree ( § 3). This issue is addressed for Path ORAM in [25] .
Attacks Not Prevented
We only limit leakage over the digital 110 pins and any resulting modified memory. We do not protect against phys icallhardware attacks (e.g., fault, invasive, EM, RF). An important difference between these and the ORAM tim ing channel is that the ORAM channel can be monitored through software, whereas physical and invasive attacks require special equipment. For this same reason, physi callhardware attacks are not covered by Intel TXT [J 1].
We now describe an example protocol for how a user would interact with a server. We refer to the program run on the user's data as P, which can be public or private. Be fore we begin, we must introduce the notion of a maximum program runtime, denoted Tmax. Tmax is needed to calcu late leakage only, and should be set such that all programs can run in < Tmax cycles (e.g., we use Tmax = 262 cycles at 1 GHz, or � 150 years). The protocol is then given by:
1. The user and secure processor negotiate a sYlmnetric session key K. This can be accomplished using a con ventional public-key infrastructure.
2. The user sends encryptK(D) to the server, which is forwarded to the processor. encryptK(D) means "D encrypted under K using sYlmnetric, probabilistic en cryption". Finally, the server sends P and leakage pa rameters (e.g., ::R; see § 2.2) to the processor. 3. Program execution ( § 6-7). The processor decrypts encryptK(D), initializes ORAM with P and D (as in [7, 26] ) and runs for up to T max cycles. During this time, the processor can dynamically change the ORAM rate based on c and ::R ( § 2.2). 4. When the program terminates (i.e., before Tmax), the processor encrypts the final program return value(s) encrypt K (P( D)) and sends this result back to the user.
Epoch Schedules and Leakage Goals
A zero-leakage secure processor architecture must, to fully obfuscate the true termination time of the program, run every program to Tmax cycles. On the contrary, the protocol in § 5 has the key property that results are sent back to the user as soon as the program terminates instead of waiting for T max. We believe this early termination property, that a program's observable runtime reflects its actual runtime, is a requirement in any proposal that claims to be efficient and of practical usage.
The negative side effect of early termination is that it can leak bits about the private user input just like the ORAM timing channel. If we consider termination time alone, pro gram execution can yield T max timing traces (i.e., one for each termination time). Further applying the theoretic argu ment from § 2.1, at most 19 T max bits about the inputs can leak through the termination time per execution.
In practice, due to the logarithmic dependence on Tmax, termination time leakage is small. As we discussed in § 5, 19 T max = 62 should work for all programs, which is very small if the user's input is at least few Kilobytes. Further, we can reduce this leakage through discretizing runtime.
(E.g., if we "round up" the termination time to the next 230 cycles, the leakage is reduced to Ig262-30 = 32 bits.)
Since programs leak :s; 19 T max bits through early ter mination ( § 6), we will restrict our schemes to leak at most that order (O(lgTmax» of bits through the ORAM access timing channel. To obtain this leakage, we split program runtime into at most 19 should be large enough so that the rate learner has enough time to determine the next rate ( § 7). A larger initial epoch also means less epochs total, reducing leakage. The initial epoch should be small enough to not dominate total run time; for the workloads we evaluate in § 9, 230 represents a small fraction of execution time. During the initial epoch, the ORAM rate can be set to any (e.g., a random) value.
Rate Learners
We now explain how rate learners select new ORAM rates in R at the end of each epoch, and how these learn ers are built into hardware.
Performance Counters and Rate Prediction
Our rate learner is made up of three components: per formance counters, a mechanism that uses the performance counters to predict the next ORAM rate and a discretization circuit that maps the prediction to a value in R. We show what data the performance counters track in Figure 4 . Req 0 illustrates an overset rate, meaning that we are waiting too long to make the next access. Recall our notation from § 2.1: an ORAM rate of r cycles means the next ORAM access happens r cycles after the last access completes. If the rate is overset, Waste can increase per access by � r. In our evaluation, ORAM latency is 1488 and rates in R range from 256 to 32768 cycles ( § 9.2). Thus, oversetting the rate can lead to a much higher performance overhead than ORAM itself.
Req f) illustrates an underset rate, meaning that ORAM is being accessed too quickly. When the rate is underset, the processor generates LLC misses when a dmmny ORAM re quest is outstanding (forcing the processor to wait until that dummy access completes to serve the miss). This case is a problem for memory bound workloads where performance is most sensitive to the rate ( § 9.2).
Req €) illustrates how multiple outstanding LLC misses are accounted for. In that case, a system without timing pro tection should perform ORAM accesses back to back until all requests are serviced. To model this behavior, we add the rate's cycle value to Waste. 
(1) where EpochCycies denotes the number of cycles in the last epoch. Conceptually, NewlntRaw represents the offered load rate on the ORAM. Note that since ORAMCycies is the sum of access latencies, this algorithm does not assume that ORAM has a fixed access latency.
7.1.3.
Rate discretization. Once the rate predic tor calculates NewlntRaw, that value is mapped to whichever element in R is closest: i.e., Newlnt argminr E :R (INewlntRaw -r l). As we show in § 9.5, IRI can be small (4 to 16); therefore this operation can be im plemented as a sequential loop in hardware.
7.2
Hardware Cost and Optimizations
As described, the learner costs an adder and a divider. Since this operation occurs only once per epoch (where epochs are typically billions of cycles each, see § 6.2), it is reasonable to use a processor's divide unit to imple ment the division operation. To make the rate matcher self contained, however, we round AccessCount up to the next power of two (including the case when AccessCount is al ready a power of 2) and implement the division operation using I-bit shift registers (see Algorithm 1). This optimiza tion may underset the rate by as much as a factor of two (due to rounding), which we discuss further in § 7.3. In the worst case, this operation may take as many cycles as the bitwidth of AccessCount -which we can tolerate by starting the epoch update operation at least that many cycles before the epoch transition.
Limitations of Prediction Algorithm
Our rate learner's key benefit is its simplicity and its self containment (i.e., it only listens to the LLC-ORAM con troller queue and computes its result internally). That said, the predictor (Equation 1) has two limitations. First, it is oblivious to access rate variance (e.g., it may overset the rate for programs with bursty behavior). The shifter implemen tation in § 7.2 helps to compensate for this effect. Second, it is oblivious to how program performance is impacted by ORAM rate.
We experimented with a more sophisticated predictor that simultaneously predicts an upper bound on perfor mance overhead for each candidate rate in ]( and sets the rate to the point where performance overhead increases "sharply." What constitutes "sharply" is controlled by a pa rameter, which gives a way to trade-off performance/power (e.g., if the performance loss of a slower rate is small, we should choose the slower rate to save power).
As we have mentioned, however, an important result in § 9.5 is that 1](1 can be small (recall: 1](1 = 4 is sufficient).
This makes choosing rates a course-grain enough operation that the simpler predictor ( § 7.1) chooses similar rates as the more sophisticated predictor. We therefore omit the more sophisticated algorithm for space.
S Preventing Replay Attacks
Clearly, the set of timing traces (denoted 'J) is a function of the program P, the user's data D, and the leakage pa rameters (e.g., c and ](). If the server is able run multiple programs, data or epoch parameters, it may be able to create 'II, 'I2, etc (i.e., one set of traces per experiment) such that log ITi l'Ii 1 > L -breaking security ( § 2.1).
One way to prevent these attacks is to ensure that once the user submits his/her data, it can only be 'run once.' This can be done if the secure processor "forgets" the session key K after the user terminates the session. In that case, Step 1 in the user-server protocol ( § 5) expands to the following:
1. The user generates a random symetric key, call this K', encrypts K' with the processor's public key, and sends the resulting ciphtertext of K' to the processor. 2. The processor decrypts K' using its secret key, gen erates a random symetric key K (where IKI = IK'I) and sends encryptK, (K) back to the user. The proces sor stores K in a dedicated on-chip register. The user can now continue the protocol described in § 5 us ing K. When the user terminates the session, the processor resets the register containing K.
The key point here is that once the user terminates the session, K is forgotten and encryptK(D) becomes compu tationally un-decryptable by any party except for the user.
Thus, encryptK(D) cannot be replayed using a new pro gram/epoch schedule/etc. The downside is a restriction to the usage model-the user's computation can only proceed on a single processor per session.
Broken Replay Attack Prevention Schemes
Preventing replay attacks must be done carefully, and we now discuss a subtly broken scheme. A common mecha nism to prevent replay attacks is to make the execution en vironment and its inputs fixed and deterministic. That is, the user can use an HMAC to bind (the hash of a fixed pro gram P, input data D, c, ]() together. If the server runs that tuple multiple times (with the corresponding program P) in a system with a fixed starting state (e.g., using [J 1]), the program will terminate in the same amount of time and the rate learners ( § 7) will choose the same rates each time.
Thus, the observable timing trace should not change from run to run, which (in theory) defeats the replay attack.
This type of scheme is insecure because of non deterministic timing on the main memory bus (e.g., FSB) and DRAM DIMM. Different factors-from bus contention with other honest parties to an adversary performing a de nial of service attack-will cause main memory latency to vary. Depending on main memory timing, the secure pro cessor will behave differently, causing IPC/power to vary, which causes the rate learner to [potentially] choose differ ent rates. Thus, the tuple described above (even with a de terministic architecture) does not yield deterministic timing traces and the replay attack succeeds. This problem is exac erbated as the secure processor microarchitecture becomes more advanced. For example, depending on variations in main memory latency, an out-of-order pipeline may be able to launch none or many non-blocking requests.
Evaluation
We now evaluate our proposal's efficiency and informa tion leakage overheads.
Methodology
9.1.1. Simulator and benchmarks. We model secure pro cessors with a cycle-level simulator based on the public domain SESC [27] simulator that uses the MIPS ISA. We evaluate a range (from memory-bound to compute-bound) of SPEC-int benchmarks running reference inputs. Each benchmark is fast-forwarded 1-20 billion instructions to get out of initialization code and then run for an additional 200-250 billion instructions. Our goal is to show that even as epochs occur at sparser intervals ( § 6.2), our efficiency im provements still hold ( § 9.4).
9.1.2. Timing model. All experiments assume the microar chitecture and parameters given in Table 1 . We also experi mented with 512 KB -4 MB LLC capacities (as this impacts ORAM pressure). Each size made our dynamic scheme im pact a different set of benchmarks (e.g., omnetpp utilized more ORAM rates with a 4 MB LLC but h264ref utilized more with a 1 MB LLC). We show the 1 MB result only as it was representative. We note that despite the simple core model in Table 1 , our simulator models a non-blocking write buffer which can generate multiple, concurrent out standing LLC misses (like Req @) in § 7.1.1).
In the table, 'DRAM cycle' corresponds to the SDR fre quency needed to rate match DRAM (i.e., 2 * 667 MHz = 1.334 Ghz). We model main memory latency for insecure systems (base_dram in § 9.1.6) with a flat 40 cycles. For ORAM configurations, we assume a 4 GB capacity Path ORAM ( § 3) with a 1 GB working set. Additional ORAM parameters (using notation from [26] ) are 3 levels of recur sion, Z = 3 for all ORAMs, and 32 Byte blocks for re cursive ORAMs. As in [26] , we simulate our ORAM on we count all accesses made to each component, multiply each count with its energy coefficient, sum all products and divide by cycle count.
We account for dynamic power only except for parasitic leakage in the LlIL2 caches (which we believe will domi nate other sources of parasitic leakage). To measure DRAM controller energy, we use the peak power reported in [3] to calculate energy-per-cycle (.076 nJ). We then multiply this energy-per-cycle by the number of DRAM cycles that it takes to transfer a cache line's worth of 16 Byte chunks (our pin bandwidth; see Table 1 ) over the chip pins. For every 16 Bytes written, the operation is reversed: the stash is read and the chunk is re-encrypted.
See Table 2 for AES/stash energy coefficients. AES energy is taken from [21] , scaled down to our clock fre quency and up to a 1 AES blocklDRAM cycle through put. Stash read/write energy is approximated as the energy to read/write a 128 KB SRAM modeled with CACTI [30] .
We assume the ORAM's DRAM controller constantly con- we fix T max = 2 62 . Thus, the baseline leakage through the early termination channel (without ORAM) is 62 bits and we will compare our scheme's additional leakage to this number. Of course, the SPEC programs run for a signifi cantly smaller time and leak fewer bits as a result.
9.1.6. Baseline architectures. We compare our proposal to five baselines:
1. base_dram: All performance overheads are relative to a baseline insecure (i.e., no security) DRAM-based system. We note that a typical SPEC benchmark run ning base_dram with our timing/power model ( § 9.l.2-9.l.3) has an IPC between 0.15-0.36 and a power con sumption between 0.055-0.086 Watts.
base_oram:
A Path �RAM-based system without tim ing channel protection (e.g., [26] ). This can be viewed as a power/performance oracle relative to our proposal and is insecure over the timing channel.
3. static300: A Path OR AM-based system that uses a single static rate for all benchmarks. This follows [7] and can be viewed as a secure (zero leakage over the ORAM timing channel) but strawman design. We swept a range of rates and found that the 300 cycle rate minimized average performance overhead relative to
base_dram. This point demonstrates the performance limit for static schemes and the power overhead needed to attain that limit.
4. static_500 and static_1300: To give more insight, we also compare against static rate schemes that use 500 ORAM rate (larger is slower) Figure 5 : The relationship between power and performance overhead for a memory bound (mcf) and compute bound (h264ref) benchmark.
and 1300 cycle ORAM rates. static500/static1300
has roughly the same performance/power (respec tively) as the dynamic configuration we evaluate in § 9.3. Thus, static500 conveys the power overhead needed for a static scheme to match our scheme in terms of performance (and vice versa for static1300).
All static schemes assume no protection on the early termination channel and therefore leak :s: 62 bits ( § 9.1.5).
Choosing the Spread of Values in ](
To select extreme values in ](, we examine a mem ory (mcf) and compute bound (h264ref) workload (Fig   ure 5) . In the figure, we sweep static rates and report 
Comparison to Baselines
Our main result in Figure 6 shows the perfor mance/power benefits that dynamically adjusting ORAM access rate can achieve. base_oram has the lowest overhead-3.35 x perfor- static300 incurs 3.S0x/S.6Sx performance/power overhead. This is 6% better performance and 47% higher power consumption relative to dynamicR4_E4. Also compared to the dynamic scheme, static500 incurs a 34%
power overhead (breaking even in performance) while static1300 incurs a 30% performance overhead (breaking even in power). Thus, through increasing leakage by :s: 32 bits (giving a total leakage of 62 + 32 = 94 bits; § 9.1.5) our scheme can achieve 30% / 34% perfor mance/power improvement depending on optimization criteria. libquantum is memory bound and our scheme consis tently incurs only 8% performance overhead relative to
Stability
base_oram. gobmk has erratic-looking behavior but con sistently selects the same rate after epoch 6 (marked e6).
After epoch 6, our dynamic scheme selects the 1290 cycle rate (see § 9.2 for rate candidates) which is why its perfor mance is similar to that of static1300. We found that astar and gee behaved similarly. .g � 6 
Reducing the Leakage Bound
We can control leakage by changing the number of can didate access rates I:RI and the epoch frequency lEI. possible that running the workload longer will fix this prob lem; e.g., we saw the same behavior with gobmk but the 200 billion instruction runtime was enough to smooth out the badly performing epoch. Despite this, dynamicR4_El6 (8 epochs in Tmax = 262 cycles) reduces ORAM timing leak age to 16 bits (from 32 bits in § 9.3) and only increases av erage performance overhead by 5% (while simultaneously decreasing power by 3%) relative to dynamicR4_E4.
Discussion
Letting the user choose L (the bit leakage limit, see § 2). So far, we have assumed that L is fixed at manufactur ing time for simplicity. To specify L per session, the user can send L (bound to the user's data using a conventional HMAC) to the processor during the user-server protocol ( § 5). When the server forwards leakage parameters (e.g., :R, E) to the processor, the processor can decide whether to run the program by computing possible leakage as in § 6.1. A second subtlety is that bit leakage can be probabilis tic [31] . That is, the adversary may learn> L bits of the user's data with some probability < 1. Suppose a program can generate 2 timing traces. Our leakage premise from § 2.1 says we would leak::; Ig2 = 1 bit. The adversary may learn L' bits (where L' > L) per trace with the follow ing encoding scheme: if L' bits of the user's data matches a complete, concrete bit assignment (e.g., if L' = 3 one as signment is 0012) choose trace 1; otherwise choose trace 2.
If the user's data is uniformly distributed bits, the adversary L learns all L' bits with probability 2 2 ;; 1.
11 Related Work
Foundational Work
This paper builds on recent work on Path ORAM and information-theoretic approaches to bounding leakage.
Path ORAM's theoretic treatment is given in [32] . Path ORAM has been studied in secure processor settings using software simulation [26] and FPGA implementation [19] .
Path ORAM integrity protection mechanisms are covered in [25] . None of above works protect the ORAM timing channel. To our knowledge, the only work to protect against 5 For perspective, in Figure 6 an average of 34% of ORAM accesses made by our dynamic scheme are dummy accesses.
the ORAM timing channel is [7] , which imposes a strict, periodic rate that we evaluate against ( § 9).
The most relevant information theoretic work is Predic tive Mitigation [2] and leakage bounding techniques for on-chip caches [14] . We use ideas similar to Predictive
Mitigation to break programs into epochs ( § 6.1), although our setting is somewhat different since [2] does not permit dummy accesses to fool an adversary. [14] applies the same information-theoretic framework to bound leakage in on chip caches. The key difference to our work is that [14] fo cuses on quantifying the leakage of different schemes. Our focus is to develop hardware mechanisms to bound leakage and trade-off that leakage to get efficiency.
More generally, timing channel attacks and related pro tections have been a hot topic since it was discovered that RSA and other crypto-algorithms could be broken through them [13] . We cannot list all relevant articles, but two re lated papers are Time Warp [20] and Wang et al. [35] . Time
Warp also uses epochs to fuzz architectural mechanisms (e.g., the RDTSC instruction) and, using statistical argu ments, decrease timing leakage. Wang et al. propose novel cache mechanisms to defeat shared-cache timing attacks.
Secure processors
The eXecute Only Memory (XOM) architecture [18] mitigates both software and certain physical attacks by re quiring applications to run in secure compartments con trolled by the program. XOM must be augmented to protect against replay attacks on memory. Aegis [33] , a single-chip secure processor, provides integrity verification and encryp tion on-chip so as to allow external memory to be untrusted.
Aegis therefore is protected against replay attacks.
A commercialized security device is the TPM [34] -a small chip soldered onto a motherboard and capable of per forming a limited set of secure operations. One represen tative project that builds on the TPM is Flicker [22] , which describes how to leverage both AMDlIntel TPM technology to launch a user program while trusting only a very small amount of code (as opposed to a whole VMM).
The primary difference between our setting and these works is the threat model: none of them require main memory address or timing protection. Address leakage is a widely acknowledged problem (outside of ORAM, [41] shows how program control flow can be determined through memory access pattern). Although main memory timing leakage has not been addressed, a lesson from prior work ( § 11.1) is that when there is a timing channel, attackers will try to exploit it.
Systems that enforce non-interference
There is a large body of work that is built around sys tems that provably enforce non-interference between pro grams ( [36] is a represenative paper). Non-interference is the guarantee that two programs can coexist, yet any ac tions taken by one program will be invisible (over the tim ing channel in particular) to the other program. In our set ting, non-interference is akin to a single, strict rate that permits no leakage [ 3 1]. We believe that our proposal, which permits some interference, may be applicable and useful to these works.
Conclusion
We propose mechanisms that provably guarantee a small upper-bound on timing channel leakage and achieves reasonable performance overheads relative to a baseline ORAM (with no timing channel protection). Our schemes are significantly more efficient than prior art which was re stricted to choosing a static rate of accessing memory.
