Abstract: Information leak is a fundamental concern in most computing systems. One security weak point in the processor-based system is the bus between the processor chip and un-trusted off-chip memory, where data can be snooped by an attacker. For the data confidentiality, encryption is commonly used. However, encryption alone is not secure since the information of the system can still be revealed through the plain memory access trace of the processor. A possible solution to such a problem is obscuring the access trace with an oblivious random-access memory (ORAM) scheme where true memory accesses are covered by random dummy accesses to the memory. But existing ORAM designs involve large number of dummy accesses for each true access, which adds significant performance overhead to the execution. In this paper, we propose a low performanceoverhead design, RF-ORAM, in which a true memory access is hidden in the dummy accesses to a small flock of random memory locations. The design has two features: one, the accessed memory data are shuffled not only within the current flock but randomly across multiple flocks, and flocks are not correlated with each other so that the randomness of the access trace can be easily achieved with small flocks, hence the performance overhead can be reduced; two, the operations in the ORAM are allowed to be overlapped, therefore, further performance improvement can be achieved. Our experiment on the Xilinx XC7VX330T FPGA platform shows that for a true memory request, RF-ORAM can reduce the performance overhead by more than 5 times when compared to the state-of-the-art design.
Introduction
Security is an important issue in computing systems. For the processor-based computing system, the buses and the off-chip memory can be easily accessed by attackers and cannot be trusted. The attacker can gain access to the bus and snoop on the data that may hold sensitive information. Even though encryption can be used to protect the data confidentiality, the plain memory address is still required for each memory access. Hence the data encryption alone is not secure since the system information can be leaked from the memory access trace. The access trace can reveal how an execution flow is diverted, how many iterations of a loop are executed and how frequently a memory location is accessed. Those access patterns can be exploited by the adversary to gain critical/sensitive information. In fact, any access trace, if not random, will carry some information that may be useful to the attacker. Therefore, making the access trace random to hide the access pattern is necessary, and to this end the oblivious random-access scheme, ORAM, can be used.
ORAM was initially proposed by Goldreich et al. [1] , as a software algorithm, to hide the access trace to the storage space on the remote servers in the cloud computing. With ORAM, data blocks are encrypted and true storage accesses are covered by massive dummy accesses. For each true access, the data blocks in a large number of storage locations are fetched, decrypted, re-encrypted, and shuffled back to the storage space. Therefore, it takes long time to complete a true access, making ORAM impractical to most applications, especially at the processor-design level. To address this performance issue, a number of improved designs have been proposed [2] , [3] , [4] , [5] , [6] , [7] , [8] , [9] , [10] , [11] , [12] . Among them, Path ORAM [12] is the most advanced one that has been used in some secure processor system designs [13] , [14] , [15] , [16] . However, the performance overhead incurred by the ORAM operation is still a big problem and can be a serious bottleneck, affecting the overall system functionality and capability.
In this paper, we propose an ORAM design where a true memory access is hidden in a flock of dummy random memory accesses. Our main contributions are:
 We presented a novel shuffling design that shuffles data not only within the current flock but randomly across multiple flocks so that the randomness of the access trace can be easily achieved with small flocks, hence the performance overhead can be greatly reduced.  We discovered the shuffle space size for the optimal trade-off between the ORAM performance overhead and the access trace randomness.  We proposed a parallel design that overlaps operations in the ORAM so that the performance overhead can be further reduced. The rest of the paper is organized as follows. Section 2 reviews the related work on ORAM designs. Our ORAM design is described in Section 3. The experiment results are presented in Section 4. The paper is concluded in Section 5.
Related Work
Square Root ORAM (SR ORAM) [1] is the first ORAM design, proposed by Goldreich et al for accessing application data on the remote server. For an application of N blocks on the remote storage, the design requires √ blocks as a dummy space for data shuffling and √ blocks as a shelter to hold accessed data. It operates in shuffle rounds. For each round, the application data will be reshuffled on its original data space and dummy space, based on which √ true blocks can be accessed. For each true access, the ORAM first performs a full shelter scan. If the required data block is found, it will make a random access to the + space; otherwise search the + space for the requested block is performed. The design has a very high cost per true memory access-the amortized cost is �√ log � and the worst-case cost is ( 2 ) [17] . For performance overhead reduction, an improved SR ORAM, Interleave Buffer Shuffle (IBS) SRORAM, was proposed [18] . IBS SR-ORAM moves the shelter from the server to the client side, reducing the accesses to the remote storage. Its amortized cost for one true access is �√ � and worst cast is ( ).
Another design to reduce the performance overhead is Basic Hierarchical ORAM [19] . The design organizes data blocks into a hierarchy of multiple levels. From the top to the bottom, each level in the hierarchy contains an increasing number of blocks. The top is used to cache recently accessed blocks. For each true access request, the design starts search, based on the block hash key, from the top level. Once the requested block is found, the rest levels are randomly accessed and the updated block is finally written back to the top level. To ensure the top level has enough space for newly accessed blocks, the design regularly shuffles data from upper levels to lower levels. The design reduces the average number of memory accesses per true memory request to O(log 4 N) and the worst-case cost to ( 3 ). Later, E. Stefanove et. al [11] proposed a partition based hierarchal ORAM [11] that organizes the server storage in multiple partitions with each partition forming a small hierarchical ORAM. In this design, position map, stash, and shuffling buffer are used. Position map stores the location of each data block in the ORAM storage hierarchy. Stash caches recently fetched data blocks from each partition. The shuffling buffer offers the place for block re-shuffling. For a true memory access, the ORAM searches the block only on the mapped partition. The block can be in either the local stash or the remote storage partition. Once the block is found, the block will be randomly shuffled to the stash of a different partition. The amortized cost of one true access is ( ) and ��( )� for the worst case. Path ORAM [12] is an advanced hierarchical ORAM design. It arranges the server storage in a binary tree. Each application data block is mapped to one leaf node. A true memory access is covered in the accesses to the path from its leaf node to the root of the tree. Similar to the partition based hierarchical ORAM, the position map is used to hold the path mapping for each data block and the stash is used to cache fetched blocks. Unlike the previous designs, Path ORAM does not require any explicit re-shuffling. The reshuffling is realized by sorting the blocks in the stash. Its cost to hide one true access is O(log N) for both best and worst cases.
Based on the Path ORAM algorithm, several implementations have been proposed [20] , [14] , [16] , [8] , [15] , [21] . PHONTOM is the first FPGA implementation [14] of Path ORAM. It uses large data block (4KB) with heap sorting in the stash. The bottleneck of PHONTOM is large FPGA resources required by parallel AES units and linear Position Map.
To reduce the large on-chip cost incurred by the linear position map, [7] presented a recursive ORAM. It, however, introduces a long delay due to recursive position map search. To address this issue, [20] proposed a "Freecursive" ORAM. The ORAM uses a PosMap cache (PosMap Lookaside Buffer or PLB) to temporarily store the PosMap blocks recently fetched to reduce the off-chip memory access, hence improving the ORAM performance. However, the improvement due to PLB is dependent to program data locality.
The most efficient PATH ORAM implementation presented so far is Tiny ORAM [15] , where a small size of (512Byte) data block is used and the cost of encryption/decryption is reduced by a special mechanism. The path ORAM has also been recently integrated in a multi-core system design [21] to co-run secure applications with non-secure applications on the system. Though the above existing designs make the ORAM implementation possible, their performance overheads are still significant and all increase with the application data size. In this paper, we propose an ORAM, called random-flock ORAM (RF ORAM) that has O(1) performance overhead in both best and worst cases, i.e. performance overhead is independent of the application size, making it highly scalable and feasible to real ORAM applications at the processor level systems design, which is discussed in the next section.
RF ORAM
With our design, each true memory access is hidden inside a flock of random accessed memory blocks. We define the address used by the processor as the "Logical Address" (LA) and the address to access the physical memory the "Physical Address" (PA). We also define a memory access requested by the processor as a "true memory access" and other accesses in a flock the "dummy accesses". A block that holds the application data is called "true data block" and a block that contains dummy data the "dummy data block". A dummy access can access either a true data block or a dummy block.
For a given application, assume its space size is blocks. Our ORAM works on this "application space" plus a "shuffle space" of dummy locations. Similar to other existing designs, our ORAM will carry out a set of tasks for a true memory access: generating a group of random memory locations (we call it "flock", due the nature of the data movement), fetching data from the flock, decrypting the data fetched (the decrypted data will be used by the processor if the true access is for read or the data will be updated if the true access is for write), re-encrypting, and shuffling data back to the flock. We collectively call this set of basic tasks an "ORAM operation". An ORAM operation converts a true memory access into a sequence of random read and write accesses to the memory.
The key part of the ORAM is shuffling data in the memory. If the data items are only shuffled over their own flock space, the randomness achieved can be very low especially when the flock size is small; furthermore, successive accesses of a same data can be easily traced since the next access location must be one of locations from the previous flock. Here we introduce a cache buffer and a shuffle buffer with which a data block read in one flock will be written to the memory in the future with other flock so that such a correlation can be broken, as will be described below.
Design Description
In this section, we present our design in a bottom-up manner by introducing first the key components of RF ORAM and then its detailed operations.
The overview of our ORAM is given in Figure 1 . It consists of a position map (PosMap), encryption (Enc pipe) and decryption (Dec pipe) components, random number generators (RNG), -commonly used in most ORAMs-, and three special buffers: cache buffer (CBuffer), shuffle buffer (SBuffer) and flock address buffer (FABuffer). There is also a read buffer (RBuffer) and a write buffer (WBuffer) to hold data for the memory read and write operations. The basic structure of the key components and their functions in the design are explained below.
The flock address buffer holds the memory addresses for the current flock. The buffer size is the same as the flock size. Each address in the buffer will be used for both the flock read and the flock write. The cache buffer caches the true data blocks read from memory. The buffer size should therefore be larger than the flock size. The buffer is structured like a fully-associative cache with each entry containing a data block and its logic address. But access to the cache is through an entry index rather than the full cache scan; hence the cost of the buffer is much lower than the traditional fully associative cache. The cache uses the random replacement policy. When a new block is cached, the existing block will be evicted to the shuffle buffer.
The shuffle buffer holds the blocks to be written to the locations of the current flock. Its size is equal to the flock size. Apart from the blocks evicted from the cache buffer, dummy blocks will be inserted if the buffer is not full. Blocks in the buffer will be shuffled before written back to the memory. The position map table (PosMap) has + entries and holds the mapping for true and dummy blocks, as illustrated in Figure 2(a) . The first field of the table indicates whether a data block is stored in an on-chip buffer or the off-chip memory. If it is in the memory, the second field provides the physical address of the block. On the other hand, if it is buffered on chip, the left most bit of the second field specifies which buffer the data block is located (0 for CBuffer and 1 for SBuffer) and the rest bits are the location of the block in that buffer. The mappings provided by the table are supposed to be unpredictable. Therefore, the position map should be initialized with random mappings (aka warmup) before it is used for a given application, and the table will be dynamically updated during ORAM operations.
To protect the data confidentiality, each data block is encrypted before sent off chip. When the data block is fetched from the memory, it is decrypted. The data will be re-encrypted every time before written back to the memory. To avoid a same data block being identified, each encryption will have a different cipher data, as has been applied in most existing data encryption designs. For this reason, we use a counter to add a variance to a block each time it is re-encrypted, as shown in Figure 2 (b) & (c) , where a data block includes a counter (cnt). The counter will be incremented by 1 for encryption (cnt++). The counter will be reset to a random value when it reaches to its maximum. Since the encryption and decryption are performed on a sequence of data blocks during memory reads and writes and they are pipelined to increase the throughput.
There are six types of random values, r1-r6, that need to be dynamically generated, as indicated in Figure 1 . They are:
 r1: to select a location from the position map for a flock, and P − 1 r1-values are required to form a flock;  r2: to select a new location for a block during the block permutation in the shuffle buffer, and P r2-values are required;  r3: to select a position in the cache buffer for block replacement, and up to P r3-values are required for each ORAM operation;  r4: to generate a dummy data if the shuffle buffer is not full, and up to P − 1 r4-values are required to fill the buffer;  r5: to select a position in the flock for a true block;  r6: to provide the random value to initialize/reset the counter for a block.
Basically, for the six types of random numbers, six generators are required. However, since the random value (r6) for the data block counter is not frequently required and the true block position (r5) is needed only once for each ORAM, we can reuse the D-bit RNG for those values. Therefore, in the end, we have four RNG generators (RNG1-4) for six random numbers, as shown in Figure  2 (d), where N is the application storage space size, B the shuffle space size, P the flock size, C the cache buffer size, D the data block size, and E the block counter size. RNG4 provides three random numbers (r4, r5, and r6) and they are used in different times. Therefore, all random numbers are not correlated.
In our design, a true data block is in one and only one of the three storage components: CBuffer, SBuffer, and the off-chip memory. If the block is in an on-chip buffer, no further ORAM operation is required; otherwise, a round of ORAM memory accesses will be performed. The control of the ORAM operation is specified in Algorithm 1, which is explained below.
On a true access requested by the processor with the logic address , the type of access / (read or write) and the new data if it is for a write operation, the ORAM controller first looks up the position map (PosMap) for the mapped location of (Line 1). If the block is on chip (either in CBuffer or in SBuffer), the access is performed on the on-chip buffer (Lines 2-3) -sending the requested data to the processor if / is for read otherwise updating the block in the buffer with the For the ORAM operation, a flock of P addresses are formed and stored in the flock address buffer (Line 5). The flock consists of the true access location ( ) and P-1 mapped addresses of randomly-selected blocks from the position map. The true block is hidden in the flock in a random position (selected by r5). The blocks in the flock are then fetched from the memory to the read buffer (Line 6) and decrypted (Line 7). Among all blocks in the read buffer, the dummy blocks are discarded and the true blocks are cached; For each cached block, an existing block selected by r3 is evicted to the shuffle buffer (Line 8). At the same time, if the true block requested is for read, the related data is returned to the processor; otherwise, the block is updated in the cache buffer (Lines 9-13). Since the fetched blocks are unlikely all true blocks, the shuffle buffer may have some empty spaces which will be filled with dummy blocks (Line 14). The blocks in the SBuffer are then shuffled against the flock address buffer; For a given address in the flock, a block in the shuffle buffer is selected by a r2 (Lines 15) to hide the information about the real blocks and dummy blocks during the flock write. Finally, each block in the shuffle buffer is encrypted before written back to the memory (Lines [16] [17] .
It must be stated that after a block is cached in the Cbuffer or evicted to SBuffer or shuffled to a different location in the memory, its entry in the PosMap will be updated accordingly, which is not shown in the algorithm.
We would also like to emphasize that the data blocks in the flock write are different from those in the flock read, and both flock read and flock read are performed in an overlapped fashion, as illustrated in Figure 3 , where the read and write operations for an ORAM (e.g.
) are performed on a flock of same memory addresses Fi, but the blocks in the read buffer ( ) are different from those in the write buffer ( ); The blocks in come from previous ORAM operations. This decoupled flock read and write design not only adds randomness to the block shuffle but also allows for the performance overhead to be further reduced. As an example, Figure 4 shows how the memory contents are changed from one (Figure 4(a) ) to another (Figure 4(b) ) by an ORAM operation (as illustrated in Figure 4(c) ), where the flock size is assumed to be 4 blocks and data block t1 is requested by the processor. For this request, RF ORAM reads the encrypted data t1' together with other three dummy accesses: {d0', d1', t1', t0'}; d0' and d1' are dummy data and t0' is a true block. After decryption, dummy data d0 and d1 are discarded (note: dummy data need to go through decryption to hide the number of true blocks fetched) and the true data blocks t1 and t0 are cached in Cbuffer, which causes blocks t7 and t4 evicted to the shuffle buffer. The empty space of the shuffle buffer is then filled by new dummy data blocks d2 and d3. The four blocks {t7, t4, d2, d3} are next shuffled to {d3, t7, d2, t4} and then encrypted. The encrypted blocks {d3', t7', d2', t4'} are finally written to the same memory locations where d0', d1', t1', t0' were initially located. As can be seen, the effectiveness of the shuffle operation plays an important role in randomizing the address trace, which is closely related to the shuffle space used. Ideally, the larger the shuffle space, the more unpredictable of a block location. But the large shuffle space leads to big storage consumption, both off-chip and on-chip (due to the increased position map). Here we try to reduce the shuffle space while achieving as high randomness as possible for the address trace, which is discussed in the next subsection.
Minimal Shuffle Space
Assume RF ORAM has the application space of N blocks, dummy space of blocks and the flock size of blocks, the total number of different flocks that can be used in the ORAM is When ≠ (0 or + ), S is larger than N; Therefore, the RF ORAM has a large space for a block to shuffle to and offers a higher randomization capability than other existing ORAM designs, such as Path ORAM where a block can only shuffle to paths.
For a given , the probability of a flock contains one dummy data is 1 = /( + ), and contains two dummy data is 2 = � /( + )� 2 , and k dummy data is = � /( + )� , < . As described in the previous subsection, we want to use the cache buffer and shuffle buffer to randomly shuffle a block back to the memory and for the randomness, we decouple the flocks for a block read and write, which can be partially fulfilled by a random number of dummy blocks. To realize the randomness of the number of dummy blocks in a flock, we want the probabilities of different number of dummy data block in a flock are as close as possible. Namely, we want the probability ratio = / +1 = /( + ), to be as small as possible. The ratio decreases with the increase of dummy space ( ) but there is an elbow point, as illustrated by the r-plot in Figure 5 . The elbow point offers a good trade-off for the optimal (minimal) dummy space.
Given the r-plot, to find the elbow point, we first create a straight line that connects the two end points of the curve. The elbow point is then at the location that forms the longest right-angled
distance from the line to the curve [22] , as demonstrated in Figure 5 . We found that at the elbow point = √ (2) Namely, the optimal dummy shuffle space size is √ , where is the application storage size. (Note: The mathematical derivation of the elbow point is omitted here due to paper limit.)
Evaluation
To verify the effectiveness of our design, we built an evaluation platform, as shown in Figure 6 . For a given application, its execution on a processor is simulated with the processor simulation tool Simplescalar [23] , from which the memory access trace, trace(LA), by the processor is obtained. RF ORAM is implemented in Verilog HDL and Xilinx ISE is used for simulation and synthesis the design based on vertex-7(XC7VX330T) FPGA. Through the Xilinx simulation, we can obtain the costs of RF ORAM and the output address trace, trace(PA). The address trace on the memory bus is then examined by NIST test suite [24] for the randomness of the physical addresses to the memory.
In our experiment, we set the memory block size ( ) as 128 , memory block address size as 18 bits ( + = 2 18 ) and the counter (cnt) size of a block as 16 bits. We also use the cache buffer (C) of 128 blocks. We choose open source AES developed by [25] for encryption and decryption. The AES is a 10-stage pipeline that can take input and produce output at every clock cycle. We adopt the same model for the off-chip memory as used in [15] and we use a linear feedback shift register (LFSR) for (pseudo) random number generation (PRNG). To make the random sequence generated unpredictable, we regularly reset each RNG by new seed generated through decrypted dummy data recently fetched from the memory and current cipher output. As RNGs have different sizes and have been fed new seeds at different times independently, this breaks any correlation among RNGs.
In addition, since the flock address generation does not depend on the requested address, we allow our design to generate flock addresses in advance if there is any idle time between the processor memory requests. The best case is when a flock is ready in the flock buffer for a new request from the processor and the worst case is when the ORAM is not idle before a request and needs to generate a flock after the processor request is received. Therefore, the completion time of an ORAM operation can be different, ranging from the best case to the worst case. Like other existing designs, the mapping in the position map should be randomized before an application starts (i.e. PosMap warmup). In our experiments, we run Algorithm 1 to store the all application data and dummy blocks in off-chip memory to initialize the position map.
Our experiments are performed on a subset of applications from SPEC-2006 and medibench. We run the simulation for different flock sizes on each application and captured the address trace for analysis. Table 1 shows the random test pass rates of the address traces when tested by NIST suit. As can be seen from the table, when the flock size is reasonably large, larger than 8 blocks, the pass rates are uniformly high for all tests. When P=16, the average pass rate reaches to 99.19%, as given in the last row of the table. We customise our design for the set of applications based on the overall randomness of the address trace, for which we therefore choose the flock size as 16.
We compare our design with the state-of-the-art design, Tiny ORAM [15] . The overheads in terms of the latency per true memory access and the FPGA area cost --the number of lookup tables (LUT), Flip flops (FF), and the block RAMs (BRAM) --are given in Table 2 . As can be seen from the table, for each true memory access request, RF ORAM takes (0.20 ∼ 0.25)μs. Compared to Tiny ORAM, the ORAM latency is reduced by 5.6 ∼ 7 times. In terms of area costs, our design consumes less FFs but more LUTs and BRAMs. The more overheads in LUT and BRAM are due to our pipelined AES and the non-recursive position map. 
Conclusion
In this paper, we presented an ORAM design, RF ORAM, that turns a sequence of applicationspecific memory requests to a random-access trace to the off-chip memory so that no information can be leaked on the memory address bus.
To reduce the large latency (normally introduced in the traditional ORAM designs), we designed to hide a true memory access into the dummy accesses to a small flock of random memory locations.
To improve the efficiency of address trace randomization through the small flock of dummy accesses, we proposed a shuffle approach that strategically uses a cache buffer and shuffle buffer to decouple blocks between the flock read and flock write so that flocks are not correlated and blocks read in one flock are shuffled back to the memory randomly in multiple flocks. The decoupling of flock read and flock write also enables the overlap of the read and write stages in an ORAM operation, hence further reducing the ORAM latency.
Like other existing designs, extra storage space (shuffle space) is required in our ORAM design. Large shuffle space normally enhances the effectiveness of the ORAM design, but this enhancement becomes diminished when the shuffle space exceeds a certain size. We developed an analytical model and through the mathematical derivation, we found that the optimal shuffle space size is the square root of the application storage size and we adopted this shuffle space size in our design.
Our experiments show that, RF ORAM reduces the ORAM overhead by 5.6 ∼ 7 times as compared to the state-of-the-art Tiny ORAM, and the address trace generated by our design has an average pass rate of 99.19% in the NIST randomness tests.
