Secure processor architectures can provide secure computing environments. All kinds of applications that need high security should be immune to both physical and software attacks through the secure architectures. Memory integrity verification is a key problem while implementing secure processors. This paper proposed a scheme called IV-BF to ensure data integrity. The scheme is one-level hash structure, and it has three advantages over existing mainstream hash tree based schemes: lower computation overhead, lower space overhead and adjustable level of security. This paper evaluated the overhead and the security of IV-BF and compared it with an efficient integrity checking scheme-BMT. The evaluation result shows that for most benchmarks, the overhead of the IV-BF is lower and the IV-BF has 1%-7% performance improvement than the BMT.
I. INTRODUCTION
Recently, phenomena such as business software being cracked, confidential information being tampered or disclosed are emerging everywhere. On critical computing platforms, attackers who have enough accessibility to the system can break security measures in any imaginable way such as using customized hardware. Specifically, in order to get the protected information, attackers may dump all the transactions of data on the system buses and then construct the customized spoofing devices or hardware and replay the data got from the bus.
Software or even light weight hardware based protection cannot resist this kind of physical attacks.
Recognizing these threats, researchers have proposed many secure processor architectures [1] [2] [3] [4] [5] [6] for providing digital copyright protection, software confidentiality, or security guarantee for running trusted applications on remote computing devices. The microprocessor industry has also offered secure processors for commercial use, including the IBM SecureBlue [9] , and Dallas Semiconductor DS5002FP [10] . These secure architectures rely on the tamperresistant processors and certain cryptographic hardware (Trust computer base, TCB) for supporting a tamper-resistant and tamper-evident (TE) computing environment. Realizing integrity verification for untrusted storage needs utilizing the un-tampered information stored in TCB through some algorithms.
Although many security architectures have been proposed, these schemes either have inadequate security protection or have high performance overheads. Authentication in XOM (eXecute Only Memory) [1] cannot detect replay attacks. The log hash scheme in Suh et al. [2] employs lazy authentication in which a program is authenticated only a few times during its execution. The Merkle tree scheme used in Gassend et al. [3] causes severe performance degradation on runtime. The CHTree scheme [4] improves runtime performance at the cost of cache space. The Authenticated Speculative Execution proposed by Shi et al. [5] employs timely authentication, but requires extensive modifications to the memory controller and on-chip caches. The TEC-Tree [6] reduce computational overhead, but also needs to maintain a hash tree.
In this paper, we propose a scheme to efficiently verify all or a part of untrusted external memory using a limited amount of trusted on-chip storage. The scheme is one-level hash structure, and it takes linear time while checking memory operations. In addition, the security strength of our scheme can be adjusted according to the requirements.
Our security model was presented in Section 2. The related work was discussed in section 3. The architecture of our scheme was described in Section 4. The overhead and performance analysis were discussed in section 5. We proposed an improvement of our scheme in Section 6. We evaluated the scheme on the simulator in section 7 and concluded the paper in Section 8.
SECURE COMPUTING MODEL
In this paper, we consider the system was built around a single processor with external memory and peripherals. We would not discuss the multi-processor platform here. Figure 1 illustrates the secure computing model we discussed.
A Method for Memory Integrity Authentication Based on Bloom Filter
The model comprises a tamper-resistant processor (TCB, trusted computing base), the external memory and peripherals. The TCB consists of the processor core, an on-chip cache, the encryption and the integrity verification mechanism. The processor's core is assumed trusted, which means that the processor is invulnerable to physical attacks and its internal state can't be tampered or observed. The processor can contain a secret data that allow it to produce keys to perform cryptographic operations such as signing. The secret data can be a private key from a public key pair, as described in XOM [1] . The processor is used in a multitasking environment, which uses the virtual memory and runs mutually mistrusting processes. Once a program executes some special instructions to enter the security executing environment, TCB is responsible for the protection of the program. The TCB or the processor needs to detect that whether memory operations are normally executed or not.
The off-chip memory, the system bus and peripheral devices are untrusted. Their states can be observed and tampered by an adversary. The target of an adversary is to tamper with the contents of external memory in such a way that the system produces an incorrect result while looks correct to the system user. To make it simple, the untrusted memory in this paper only means the RAM, although in fact, the scheme presented here can be applied to other data storage devices such as hard disks with only a little change.
The adversary can attack off-chip memory and the processor needs to check that whether the content got from memory is right. If the data read from an address in main memory is the same as the value that recently stored, we deem that the memory is safe so far. If the contents of the off-chip memory have been tampered by an adversary, the memory may not behave correctly. tampering has occurred, the scheme we proposed allows the processor to detect with high probability. If tampering is detected, the processor will raise an integrity exception.
RELATED WORK
The research area of data integrity and confidentiality is now very active. Many work presented new schemes for uniprocessor [7] [8] [9] [10] [11] [12] [13] [14] and multi-processors platform [15] [16] [17] [18] . In this paper we focused on the integrity protection scheme of uniprocessor.
The basic protection method of memory integrity is Message Authentication Code (MAC) [2] . MAC is a hash value computed over the memory chunk to be used in authentication later. The MAC's advantage is that it computes and compares quickly and can detect most attacks. But MAC is vulnerable to replay attacks. MACs guarantees that a chunk is stored by the processor, but do not ensure that it is the most recent copy. To avoid this shortcoming, all MACs must be stored in security area, which needs large trusted storage space. For this reason, simple MAC method can only be applied to static instructions, while not to dynamically changed data. XOM use MAC for protecting off-chip data integrity.
Lhash (Log Hash Integrity Checking) and H-Lhash (Hierarchical Scheme of Log Hash Integrity Checking) [2] are the "Lazy" memory integrity verification scheme proposed by MIT. The processor maintains a read log and a write log of all of its operations to off-chip memory. At runtime, the processor updates logs with minimal overhead so that it can verify the integrity of a sequence of operations at a later time. H-Lhash is caching a part of nodes based on Lhash, which permits more frequent verification. Because most of the time it only records logs. The methods of verifying memory integrity work with low runtime overhead, but it can't detect attacks immediately.
Hash tree (or Merkle tree) [3] is often used to verify the integrity of dynamic data in untrusted storage. The memory space is divided into multiple chunks. The chunks are the leaves of a hash tree. The parent is the hash value of the concatenation of its children. The root of the tree is stored on-chip where it cannot be tampered. Hash tree integrity verification is often preferable over other schemes because of its security strength. Besides spoofing and splicing attacks, replay attacks can also be prevented.
There are some improved methods based on hash tree, i.e. MIT proposed CHTree (Cached Hash Tree) [4] , which caches the internal hash nodes on-chip. The processor trusts data stored in the cache, and can perform memory accesses
270
A Method for Memory Integrity Authentication Based on Bloom Filter directly on them without any hashing. Therefore, the processor checks the path from the chunk to the first hash it finds in the cache. This hash is trusted and so the processor can stop checking. This scheme can increase the verification speed, but occupy the on-chip cache resource. Reference [6] proposed the TECTree scheme which divides the plaintext into data block and nonce. The nonce blocks were protected by a hash tree (TEC-Tree). In decryption, the decrypted nonce is compared with the prior saved nonce to finished authentication. The scheme reduce overhead of MAC computations, but the hash tree still occupy memory space. Reference [7] [8] proposed Bonsai Merkle Trees (BMT). BMT is an efficient Merkle Tree-based memory integrity verification technique. In BMT, the MAC of a block is computed using a hash function with ciphertext and counter as its input, and a Merkle Tree is built based on counters. The BMT takes less space in the main memory and less space in the L2 cache and resulting in a significant reduction in the execution time. The overhead is from 12.1% to 1.8% across all SPEC 2000 benchmarks.
AN INTEGRITY PROTECTION APPROACH 4.1. Introduction of Bloom Filter
Bloom filter is a simple space-efficient randomized data structure [19] . A Bloom filter for representing a set S = {x 1 x 2 , …, x n } of n elements is described by an array of m bits, initially all set to 0. It can judge whether y is a member of the set quickly. Bloom filter has the characteristics of low overhead and parallel computing. The principle of Bloom filter is as follows. Initially, the array of m bits is all set to 0. A Bloom filter may use k independent hash functions with value field {1, …, m}. It makes the assumption that these hash functions map each item to a random number in the range {1, …, m}. For each element x ∈ S, the bit which the k hash functions maps to are set to 1. A location can be set to 1 multiple times, but only the first change has an effect. To check whether an item y is in S, we checked whether all hash map bits of y are equal to 1. If not, then clearly y is not a member of S. If all the mapped bits are 1, we assume that y is in S. While checking whether an item is in a set, Bloom filter may be wrong with some probability which can be called "false positive". Assume that we have a set that is being changed over time, which means that there are elements being inserted and deleted. Inserting elements into a Bloom filter is easy, but one cannot perform a deletion by reversing the process. If the hash element is deleted and the corresponding bits are set to 0, we may be setting a location to 0 that is hashed by some other elements in the set. In this case, the Bloom filter no longer correctly reflects all the elements in the set.
Journal of Algorithms & Computational Technology
To avoid this problem, Fan et al. [20] introduced the idea of counting Bloom filter. In a counting Bloom filter, each entry in the Bloom filter is not a single bit but rather a small counter. The corresponding counters are incremented when an item is inserted, while the corresponding counters are decreased when an item is deleted.
The Principle of IV-BF
We applied the Bloom-filter to the memory integrity protection and proposed an integrity protection scheme called Integrity Verification approach based on Bloom-filter (IV-BF). Figure 2 illustrates the architecture of the IV-BF. Initially, the memory space is divided into multiple equal blocks. We assumed there are n blocks which need being protected. We maintain an array of m entries on chip and each entry is a counter. Initially all counters are set to 0. L.Fan [20] has proved that 4 bits per counter should be sufficient for most applications. Therefore, we set 4 bits for each counter whose range is 0-15. We also setup k independent hash functions which can mirror any data to [0 … m-1].
In establishing protection, each block and address does k hash computations, then these hash values are mapped to the array. The corresponding counters are incremented by 1 when each mapping happens. When the data in memory is update, each corresponding block combined with its address is computed using k hash functions, then the corresponding counters are increment by 1. When reading data from the memory, we compute corresponding block and its address again and check whether the corresponding counters are equal to 0. If not, we deem the data is right, otherwise we deem that the block has been tampered. As shown in Figure 2 , the 5th entry of array has been mapped 3 times, therefore, its counter is 3. When modifying the second data block, we should decrement the 1st, 3rd, and 5th counters by 1, then the modified block does k hash computations again and corresponding counters are incremented by 1. When checking the third block, we should do k hash computations again. If a counter mapped in the array equals 0, the block must have been tampered. We should address that for the inherent character of the Bloom filter, the IV-BF has certain "false positive rate" when checking whether a block is tampered. However, as long as the false positive rate is sufficiently low (for instance lower than 0.01%), it will satisfy the requirement of integrity verification. The factors of false positive rate including the memory space of protecting, the bits of array and the number of hash function. When memory space of protecting is fixed, the false positive rate is inversely proportional to bits of the array, that is, lower false positive rate (fewer collisions) leads to longer array length. On the other hand, when false positive rate is fixed, the array length is proportional to the memory space being protected, that is, more space leads to longer array length. When the memory space and the bits of array are fixed, increase the number of hash function will decrease the false positive rate. By the way, we should point out that even some classical protection schemes, e.g. MAC method also has certain false positive rate.
Moreover, besides false positive, "false alarm" also may happen when using counting Bloom filter to protect main memory, which means that a right block is deemed as a wrong block. The reason is that when a block where false positive happened needs update, the mirror bits of an integrated block may be deleted. Therefore, the block is deemed as an un-integrated block when checking again. False alarm is also the inherent character of the Bloom filter, and it has smaller possibility than false positive. In fact, the "false alarm" can be seen a good thing, because when a "false alarm" happened it means so far at least a tamper behavior has happened. And then the system can take some measures.
There are three advantages of the IV-BF compared with the hash tree based schemes. First, it has lower computation overhead. The computation overhead depends on number of computation, computation quantity and the hash function. The computation quantity is decided by the hash function. Therefore, the higher number of computation led to the more computation overhead. IV-BF can compute multi-hash functions in parallel mode because these computations are mutual independent, while hash tree can only do computing level by level from down up to the root. Second, it has lower space overhead. Hash tree needs a lot of memory space to save internal nodes, while IV-BF only needs a little on-chip memory space to maintain an array and don't needs extra memory space. So the IV-BF can save lots of space at the cost of extremely low false positive rate. Third, its security strength is adjustable. When data integrity requires of application extremely high, we can set lower false positive rate and computing overhead is relative higher. When the requirement is not too high, it means a certain of false positive rate is permitted. Therefore, we can reduce length of the array and the number of hash functions which further decrease the computing overhead.
Rules and Arithmetic of the IV-BF
In order to make the IV-BF work efficiently, we gave some regulations and algorithms. In our description of algorithms, we used the term "chunk" as the minimum memory block that is read from and written to memory for integrity checking. If a word within a chunk is accessed by a processor, the entire chunk is brought into the processor. 
OVERHEAD AND SECURITY ANALYISIS OF THE IV-BF
This section we analyze the performance of the IV-BF and compare it to the BMT where we take binary tree for example. Some analyses of the false positive rate, the optimal hash functions and the bits of array own to Andrei Broder's conclusions [19] .
The IV-BF Operation Overhead
The IV-BF's operations include setup procedure, checking data and modifying data. The former two both do k hash computations for each data block, therefore, they have equivalent overhead. The last process does 2 k hash computations for each data block, therefore, the last one is twice the overhead of the former two. The overhead of each of the former two are:
Among (1), a is the byte access time, s is the memory block size, k is the number of hash functions,`t is the average time of each byte of hash computation and T is the overhead function. Simplifying (1) we can get: or shown as:
From (2) we know that the operation overhead of data is proportional to block size when k, a and `t are fixed.
Hash Tree Operation Overhead
For hash tree, the overhead of operating a data block is E¢ = E l + E i ¥ d. In the formula, E l is the leaf node overhead, E i is the internal node overhead, and d is the depth of hash tree. We have known ,
In the formulas, s¢ is the memory block size, t is the time of hash computation of each byte, i is the internal node length and A is the whole memory space needs to be protected. Therefore, the whole operation overhead is:
shown as:
The IV-BF Performance Gain Compared with Hash Tree
In general condition, hash tree uses one hash function, e.g. MD5 or SHA-1, while IV-BF uses k different functions. For computation convenience, we set `t = t which means each byte's average computing time of IV-BF is equal to hash tree, then using (2) and (3) we got the speed gain is:
Set a = p ¥ t. p is a ratio of a and t, simplifying (4), we got:
We assume that the space being protected is 4GB. The parameters of the two schemes for comparison are stated below. The IV-BF's block size is 1024 bytes, hash tree's block size is 64KB, hash tree's internal node size is 16 bytes, p is set 1/4, 1/2, 1, 5, 10 separately and k is set 2, 4, 6, 8, 10. Then we obtained performance gain of IV-BF compared with hash tree and the results are shown in Figure 3 . The horizontal coordinate is the number of hash function, and vertical coordinate is the speed gain.
As shown in Figure 3 , when p is fixed the gain rate is inversely proportional to the number of hash functions, because the more hash functions, the more each operation overhead of IV-BF and the less gain rate. When k is fixed, increasing p means longer byte access time and leads to the more gain rate, because when p is larger, there is higher operation overhead of each data block
A Method for Memory Integrity Authentication Based on Bloom Filter
while verifying Hash Tree. In addition, the gain rate will increase when the data block size of the IV-BF is decreased.
The IV-BF Security Analysis
We compare IV-BF with SHA-1. At first, we should know the false positive rate of SHA-1. Wang xiaoyun [21] has present a new collision search attacks on the hash function SHA-1. It shows that collisions of SHA-1 can be found with complexity less than 2 69 hash operations. Therefore, we can deem the collision rate (false positive rate) is 1/2 69 . We already know the IV-BF has some probability of false positives. To verify data integrity, we should keep the false positives rate as low as possible. We introduced m, n and e which respectively represented the length of array, the number of memory blocks being protected and permitted max false positive rate. Reference [19] has proved that when e and n are fixed, the array size satisfies (6): (6) From (6) we know that in the situation that the false positive rates are no more than a fixed value and the array's length of IV-BF should at least be n log 2 (l / e). We set n is 1000 and m is from 5.6*10 4 bits (6.8KB) to 6.2*10 4 bits (7.6KB), the positive rate of IV-BF and SHA-1 is shown in Figure 4 . The horizontal coordinate is the length of array, and vertical coordinate is false positives rate. From the figure we know the IV-BF has higher false positive rate than SHA-1 when Figure 3 . IV-BF gain rate compared with hash tree. 
Bloom filter array length is shorter, which means less security. But the false positive rate of IV-BF is very close to SHA-1 when the length of Bloom filter array is long enough. At the same time, high security strength means longer array size and more cache occupation.
For current situation, SHA-1 is safe for very low false positive rates. Through above analysis we know the IV-BF can also achieve very low false positive rates. Therefore, we can say the IV-BF is safe too.
THE IMPROVED SCHEME OF THE IV-BF
For further decreasing the usage of cache space, we proposed an improved scheme of the IV-BF. As shown in Figure 4 , we use a separate hash tree to protect the array of IV-BF. Only the root of hash tree is saved in safe area (e.g. cache) while other nodes of hash tree and the array of Bloom filter are saved in untrusted area (e.g. main memory). For computation convenience, we are using 64 KB array space as a hash computation unit.
The integrity verification of this scheme is through two levels: the first level is checking the integrity of the array, which is assured by the root node in trusted area. The second level is checking the integrity of memory, which is assured by the array. We also used term chunk as the minimum memory block that is read from and written to memory for integrity checking. The detailed algorithms of memory operations are followed.
Initialization Operation (i) An array is set in main memory, and then the array is divided into multiple chunks each of which is as a computation unit. (ii) All bits of the array are set to 0. (iii) k hash computations are done on each chunk combined with its address. 
278
A Method for Memory Integrity Authentication Based on Bloom Filter
SIMULATION 7.1. Simulation Framework
Our simulation framework is evaluated by modified SimpleScalar simulator [22] , which models speculative out-of-order processor with separate address and data buses. SimpleScalar configured to execute Pisa binaries. Considering significant performance degradation of the standard hash tree, the IV-BF is compared with the BMT. The modified simulator supports the IV-BF and the BMT. We use the term "baseline" to refer to a standard processor without integrity verification or encryption. In the simulation, we evaluate the overhead of memory authentication mechanisms compared to the baseline system with the same configurations.
The architectural parameters used in the simulation are shown in Table 1 . For all the simulations in this section, eight SPEC2000 CPU benchmarks [23] Hash latency 160 cycles
Hash length 128bits
Hash throughput 3.2GB/s used as representative applications: vortex, vpr, art, parser, mcf, gzip, mesa and equake. These benchmarks show varied characteristics such IPC (instructions per cycle), cache miss rates, etc. To capture the characteristics in the middle of computation, each benchmark is simulated for 100 million instructions after skipping the first 1.5 billion instructions. The major logic component to implement the IV-BF and the BMT is a hash (MAC) computation unit. To evaluate the cost of computing hashes, we choose SHA-1 [24] as the hashing algorithms. The core of each algorithm is an operation that takes a block and produces a 128-bit digest.
IPC impact
Integrity checking impacts the run-time performance mainly in three ways: IPC, cache pollution and bandwidth consumption. First, hash computations increase system latency, and degrade the IPC. Second, when we cached hashes in L2 cache, hashes contending with regular application data can degrade the L2 cache miss-rate for applications. Third, loading or storing hashes from or to the main memory increases the memory bandwidth usage, and may steal bandwidth from applications.
We first compare IPC of the IV-BF and the BMT. For the convenience of comparison, both the IV-BF and the BMT use the SHA-1 as hashing algorithms and occupy 64KB L2 cache. BMT cached part of inner node of counter hash tree and IV-BF cached the Bloom filter array. And then the BMT sets block size for 64B and the IV-BF sets the block size for 16K. Figure 5 illustrates the impact of memory authentication on run-time program performance. The normalized IPC of IV-BF and BMT are shown. The IPC are normalized to the baseline performance with the same configuration.
The results clearly show the advantage of the IV-BF scheme over the BMT scheme. For all cases we simulated, the IV-BF outperforms the BMT. The performance overhead of the IV-BF scheme has as much as 28% overhead in the worst case (parser) and 7.8% in average. On the other hand, the BMT has as much as 35% overhead in the worst case and 10.4% in average. Therefore, the IV-BF has an average of 3.4% performance improvement than the BMT. The reason is the IV-BF can compute hashes in parallel, while the BMT can only serially compute hashes from leaf to root.
Cache Pollution
In IV-BF, we cached an array sharing the same L2 cache with a program executing on a processor, and both the array and application data contend for L2 cache. This can increase the L2 miss-rate for a program and degrade performance. Figure 6 depicts the L2 miss-rate of the baseline case (Base) and IV-BF scheme. For a 256KB L2 cache, the miss-rate of both scheme are high. Compared with Base scheme, for all benchmarks, the L2 miss-rate of IV-BF can be noticeably increased (35%-150%). The reason is the Bloom filter array contends for L2 cache. In fact, cache contention is the major source of performance degradation for most applications. Moreover, the performance
282
A Method for Memory Integrity Authentication Based on Bloom Filter degradation decreases rapidly as either the L2 cache size increases. For a 4MB L2 cache, the miss-rate of both Base and IV-BF schemes are fairly low (0.5%-18%). None of the benchmarks show noticeable L2 miss-rate degradation (less 10%). The reason is that the array size is determined, then as L2 cache size increase, the proportion of array size is decreased which alleviated the cache contention by reducing the number of hashes to cover a given memory space.
Throughput of Hash Computation
The throughput of computing hash functions varies depending on how the logic is pipelined. Obviously, higher throughput is better for the performance, but requires larger space to implement. Figure 7 shows the IPC of various applications using the IV-BF with caching for varying hash throughput. As shown in the figure, increasing the hash bandwidth affect performance little when hash throughput more than 3.2GB/s. When the throughput bellows to 1.6GB/s, which is the same as memory bandwidth, we can see minor performance degradation. When hash throughput is lower than memory bandwidth, it directly degrades performance. This is because the effective memory bandwidth is limited by the hash computing throughput. Therefore, the hash throughput should be slightly higher than the memory bandwidth. 
CONCLUSION
We have presented a memory integrity verification scheme with low run-time and low memory space overhead which can be used to build high performance secure computing platforms out of slightly modified general purpose processors. By integrating an array of Bloom filter with an on-chip cache, we arrived at a memory authentication scheme with high reliability. The performance analysis and simulation results have showed that the IV-BF has high efficiency and low overhead. Ongoing work includes investigation of memory verification schemes of lower overhead, combining the Bloom filter authentication with encryption and the generalization of authentication schemes to SMP and DSM systems.
