Abstract-In paravirtualization, the page table management components of guest operating systems are properly patched for the security guarantees of the hypervisor. However, none of them pay enough attention to the performance improvements, which results in two noticeable performance issues. First, such security patches exacerbate the problem that the execution paths of the guest page table (de)allocations become extremely long, which would consequently increase the latencies of process creations and exits. Second, the patches introduce many additional IOTLB flushes, leading to extra IOTLB misses, and the misses would have negative impacts on I/O performance of all peripheral devices.
I. INTRODUCTION
In paravirtualization [1] , [2] , the operating system of each Virtual Machine (a.k.a. guest or guest domain) and the hypervisor share the same virtual address space. In order to prevent malicious accesses from the guest OS, the hypervisor sets the guest page tables as read-only, and validates their updates to ensure that there is no runtime violation [1] . However, only the guest page table based protection is not enough to defend against the DMA attacks driven by the malicious guest OS [3] . To fix this gap, the hypervisor resorts to the I/O virtualization (AMD-Vi [4] or Intel VTd [5] ) technology and thus leverages a new Input/Output Memory Management Unit (IOMMU) to restrict DMA accesses to the physical memory occupied by the hypervisor and the guest page tables. Specifically, IOMMU, working like the traditional MMU, maps device addresses to physical addresses through dedicated page tables (i.e., I/O page tables in the hypervisor space). By configuring the I/O page tables exclusively, the hypervisor grants different access permissions to different physical memory pages accordingly, thus preventing malicious DMA accesses. As Xen [1] is a popular and commercial paravirtual hypervisor, we use Xen in the x86 MMU model [6] to give more details. For the physical memory pages of a guest domain, the hypervisor defines two page types 1) writable page that is writable for both the guest and DMA, and 2) non-writable page (e.g., the page-table page) that can be modified by neither of them, and makes sure that every page of the guest has exactly one appropriate page type at any give time (e.g., page-table pages are non-writable). Thus, pages of different page types have different access permissions, which can be achieved by configuring the page tables and the I/O page tables respectively. If any page-type updates occur, the hypervisor must enforce a security validation scheme. It performs (1) the guest validation that validates the entries of the page tables, ensuring that the guest cannot write the non-writable pages through relevant page tables, and (2) the DMA validation that validates the entries of the I/O page tables, ensuring that the DMA cannot write the non-writable pages through relevant I/O page tables. Take the page table allocation path as an example. When pages are allocated by the buddy system [7] as page-table pages, the hypervisor will be involved in the path to perform the security validations. If both validations are legal, page types of the pages will be updated from writable to non-writable. Also, when the pagetable pages are deallocated, their page types will be reverted back and thus they become writable again. Please note that TLB (Translation Look-aside Buffer) and IOTLB are used to accelerate the guest and DMA address-tranlsation speed respectively. When the hypervisor updates one page's type, the page's access-permission will be changed. It indicates that the hypervisor must flush relevant TLB and IOTLB entries when processing the security validations. Otherwise it is possible for the adversary to attack the hypervisor by leveraging stale (IO)TLB entries. To enforce such protections, both the guest and the hypervisor are required to be properly patched.
Existing Memory Allocators

DMA Validations
(a) Baseline (Page Table Allocation Problem. However, all existing patches such as [8] mainly focus on the security enhancements of the hypervisor, without paying enough attention to the performance improvements, which results in two noticeable performance issues. The first one is a long execution path of the guest page table (de)allocation, which involves a complex memory (de)allocation process and an additional security validation procedure. The memory (de)allocation process frequently involves a slab allocator [9] and page frame (de)allocations that are frequently managed by the buddy system, which introduces deep invocations for the page table (de)allocations. Moreover, the additional security validations enforced by the hypervisor introduce extra costs for preventing the page tables from being attacked by the malicious guest and DMA. All these lead to poor performance of the page table (de)allocations, which consequently results in long latencies of creations and exits of processes.
The other one is the additional IOTLB flushes introduced by the security validations of the page table (de)allocations. As described above, it is the access-permission updates that require IOTLB to be flushed. On top of that, these updates are often triggered during the whole lifetime of a running system. Thus, the IOTLB flushing events do frequently occur, which inevitably increases the IOTLB miss rate and lowers the speed of the DMA address-translation. All these are likely to introduce negative impacts on the I/O performance of all peripheral devices. The baseline of Figure 1 (a) (Figure 1 (b) ). The PiBooster cache queues the deallocated page-table pages (pushed into the cache), with the hope that they (popped out of the cache) will re-serve the page table allocations in the near future. By doing so, page table allocations do not have to involve the costly memory management subsystem (e.g., the buddy system) every time, instead they could directly get pages from the cached buffers, dramatically shortening the paths.
Second, PiBooster eliminates the additional IOTLB flushes with a fine-grained validation scheme (Figure 1 (c) ), which completely separates the guest and DMA validations. In the traditional design, there are two types of guest memory pages: writable page that is writable for the guest and DMA, and non-writable page (e.g., page-table page) that are not writable for both of them. As discussed before, both page table allocations and deallocations always involve the pagetype changes between the two types. Thus, the hypervisor has to do both the guest and DMA validations to ensure that neither of them violates the security scheme. However, if we introduce a new page type (i.e., semi-writable page) representing a writable permission for guest access and a non-writable permission for DMA access, and ensure that the page-type changes only occur between the page-table pages and the semi-writable pages when the page tables are (de)allocated, then it is possible for the hypervisor to remove the DMA validation from the page tables (de)allocation paths and only perform the guest validation, because the two page types are non-writable for the DMA. On top of that, the management of the semi-writable pages can be perfectly assisted by the PiBooster cache as they can be reserved for future page table allocations.
In summary, we make the following contributions: 1) We identified two significant performance issues in the guest page table management. In particular, we are the first, to the best of our knowledge, to identify the performance issue between page Specifically, PiBooster is able to completely eliminate the additional IOTLB flushes, and effectively reduce 47% of execution time for three-level page tables allocation and deallocation. Particularly, the latencies of both process creations and exits are expectedly reduced by 16% on average. Besides, the benchmarks tools (i.e., SPECINT, netperf, and lmbench) show that PiBooster has no negative impact on the overall system performance.
The rest of the paper is structured as follows: We describe the system overview and implementation in Section II and Section III. In Section IV, we evaluate the performance of PiBooster, and discuss related work in Section V. At last, we conclude the whole paper in Section VI. 
II. PIBOOSTER OVERVIEW
'$ " %# ' ( % "
A. Design Requirements
In the design of PiBooster, we consider several requirements, which are summarized and listed as follows. 1) Unaltered system security. The new scheme should not sacrifice the system security to obtain performance benefits. No one likes to use a system with known design loopholes. 2) Compatible with legacy applications. The new scheme should limit the modifications on the guest kernel and the hypervisor, without any modifications on the existing applications. 3) Small modifications. The new scheme should minimize the development cost on the guest kernel and the hypervisor. Figure 2 depicts the architecture of PiBooster. It consists of a PiBooster cache in the guest kernel space and a PiBooster module in the hypervisor space.
B. PiBooster Architecture
The PiBooster cache manages the semi-writable pages and makes sure that the page type updates between the writable page and the page-table page must go through the semi-writable page first. Note that the page-table pages are a subset of non-writables pages and they are what we concern in this paper. For instance, if any page table (de)allocations arise, the PiBooster cache always satisfies the (de)allocations without invoking the complex memory subsystem, thus shortening the execution paths. And then, it issues hypercalls to the hypervisor, which will perform the fine-grained security validations (i.e., the communication channel) to update relevant page types.
The PiBooster module enforces a fine-grained validation scheme on the update requests. Unlike the traditional security validation scheme, the PiBooster module only needs to perform the guest validation when page table (de)allocations occur without the DMA validation, indicating that the additional IOTLB flushes could be completely avoided (i.e., the vanished channel).
C. Fine-Grained Validation
The fine-grained validation scheme aims to eliminate additional IOTLB flushes and reduce the execution time of the security validations. In the traditional coarse-grained validation scheme, it has both guest and DMA validations, and the second validation not only increases the total validation time, but also introduces additional IOTLB flushes.
To address this problem without sacrificing the system security, we introduce a new page type: semi-writable page, which is writable for guest but non-writable for DMA. Since the page-type only changes between semi-writable and non-writable when page table (de)allocations occur and both types are already inaccessible to the DMA, the hypervisor does not have to do the DMA validation when dealing with these pages. As a consequence, the time of the whole security validation process is reduced, accelerating the speeds of page table (de)allocations. Much like the management of the page-table pages, the hypervisor is only responsible for the security validations of the semi-writable pages, leaving all other management operations to the guest (i.e., the PiBooster cache). By doing so, we can keep the modifications as small as possible by reusing the existing validation process (i.e., the guest validation) and page-table management subsystem, and also retain the system security.
D. PiBooster Module
The PiBooster module works in the hypervisor space, extended from the coarse-grained validation module. The first task of the PiBooster module is to add the support for the semi-writable page. Instead of adding a new data structure for marking the semi-writable pages, the PiBooster module chooses to reuse the existing one.
The second task of the PiBooster module is to perform the fine-grained security validations for all page type update requests. The main logic is to check page type first, and then determine to perform either one validation or both. Specifically, if the page type update occurs between writable page and semi-writable page, it will perform both guest and DMA validations. However, if the page type update occurs between semi-writable page and page-table page, the PiBooster module only perform the guest validation, skipping the DMA validation. Note that the guest validation is always necessary, as the semi-writable pages are writable for the guest.
The PiBooster module also exports a new hypercall interface for the PiBooster cache to facilitate their communications. Through the new interface, the PiBooster cache could explicitly invoke the PiBooster module to perform the security validations on specific page type updates.
E. PiBooster Cache
The basic idea behind the PiBooster cache is to have the cache of semi-writable pages available for page table (de)allocations. Without the page-oriented PiBooster cache, the kernel will spend much of its time allocating, initializing and freeing page-table pages. The slab allocator that is similar to the PiBooster cache is not used in our settings, due to the following reasons. First, it is not aware of the page type or the security requirements, which would lead to unexpected crashes of the guest OS. Second, the existing size-oriented management of the slab does not distinguish the page-table pages from other pages that have the same size, and customizing such a management mechanism requires high development costs, such as interface updates. At last, adding the fine-grained validation mechanism only for one object (i.e., page-table page) will subvert the generality of the slab allocator. Considering the above three reasons, we aim to build the PiBooster cache, serving the page table (de)allocations.
1) PiBooster Cache Initialization and Destruction:
PiBooster cache is enabled in the system bootup phase by default. By doing so, the page tables of all user processes are served by the PiBooster cache from the very beginning. To increase the flexibility, it also allows dynamic activation at runtime through an exported interface.
In the initialization phase, the PiBooster cache allocates a bulk of pages from the existing system allocators, converts them into semi-writable pages, and maintains them in a dedicated cache list. At runtime, the page table is always successfully deallocated by efficiently pushing deallocated pages into the PiBooster cache. But the case of the page table allocation is a little bit complex. Normally, the PiBooster cache could serves all page table allocation requests using the cached pages. However, in the worst cases, the cached pages may not be able to satisfy the allocation requests. In such conditions, the PiBooster cache will have to re-invoke the system allocators to get new pages. Fortunately, the reinvocations could be avoided by carefully setting the number of the initial pages. In fact, the re-invocations rarely occur in a workflow-stable system. In our experiments, there are always several semi-writable pages in the PiBooster cache ready for the page-table allocations after the PiBooster is running for a few minutes.
The PiBooster cache works in the whole lifetime of the guest by default, but an end user is able to explicitly disable it at any time through the exported interface. Once the PiBooster cache receives the disable command from the user, it will securely release all resources, for instance, asking the hypervisor to update the semi-writable pages to be writable.
2) Cache Shrinking: When the memory management daemon (e.g., kswapd) notices that the available memory is tight, it will explicitly call exported interfaces of the PiBooster cache to free some cached pages. There are two interfaces to shrink the cache pages. One is based on the page number. The memory management daemon can specify a number to ask the PiBooster cache to release. The other one is based on the percentage. For instance, the kernel could ask the PiBooster cache to release 50% cached pages.
The PiBooster cache could also automatically shrink itself through a predefined threshold. The threshold can be defined according to the absolute page number or the proportion (i.e., the number of the cached semi-writable pages over the number of the page-table pages), or a combination of them.
III. IMPLEMENTATION
In this section, we present the implementation details of the PiBooster module and the PiBooster cache based on Xen [1] (the hypervisor) and Linux (the guest kernel).
A. PiBooster Module
The first task of PiBooster module is to extend the existing data structure to support the semi-writable page. Xen dictates that the data structure for labelling page types occupies 32bits, i.e., bits 28 -31 are allocated for page type, bits 23 -27 are for others (e.g., bit 26 indicate if this page has been validated), and bits 0 -22 are for a reference count of one page type. The existing page types have occupied all page type bits, and there is no extra bit available for the semiwritable page. Facing this problem, we do not introduce new data structures. Instead, we choose to borrow a bit from the reference count. In particular, the reference count field has 23 bits, representing the number of type references for a page as its current type. In fact, the hypervisor usually does not build so many references (i.e., 2 23 −1) to one page. Thus, we borrow the highest bit (bit 22) representing the page type of semi-writable. As a consequence, it still supports more than 4 million reference counts, enough for almost all cases.
Actually, the hypervisor with PiBooster is functioning well in our experiments.
Besides that, the PiBooster module needs to patch the page type checking functions, e.g., get page type, to adjust the checking logic. The added checking logic is straightforward and we only add or change 166 SloC to achieve the whole patch. In addition, the DMA validation is skipped when the semi-writable page is involved in the page type update, simplifying the whole validation logic.
B. PiBooster Cache
There are three-level guest page tables, and the PiBooster cache maintains a single-linked list for each of them. Each node of the list has a pointer pointing to the next node and a page ID that is the base address of a cached page. Note that this address should be a physical address, rather than a virtual address, because the physical address of a page is unique in the whole system but its virtual address is not constant. Thus, using the virtual address would lead to the confusion of the page type tracing and the semiwritable page management. As the page table allocations and deallocations could happen at any time on any core, each list has its own lock to support concurrent updates.
The PiBooster cache has two interfaces for the runtime page table allocations and deallocations. The pop interface is for the page table allocations. When this interface is invoked, the PiBooster cache fetches the top node of the corresponding list, extracts the base address of the cached page, and then returns it to the caller. Correspondingly, the push interface is serving the page table deallocations. In the push interface, the PiBooster cache saves the base address of the deallocated page into a node, and inserts it onto the top of the list. Obviously, the pop and push operations are extremely fast as they could response the requests in a constant time. In addition, the PiBooster cache also exports an interface for the memory management daemon for explicitly shrinking the cached pages.
The PiBooster cache adds several virtual files in sysfs [10] , [11] that is a virtual file system provided by the Linux kernel. By using the virtual files, the end user could not only send commands to the PiBooster cache, but also query and configure the internal status. In the current implementation, we only support one command, which is able to activate and deactivate the cache service in an on-demand way. In addition, the end user could read and write the virtual files to query the number of the cached pages and dynamically shrink the cache size, respectively.
The PiBooster cache mediates all page type updates related to the semi-writable pages. For each page type update, the PiBooster cache would issue the hypercall exported by the PiBooster module to explicitly inform the hypervisor to perform the security validations. In certain cases, such as shrinking the cached semi-writable pages to writable pages, the PiBooster cache could submit a batch of requests in one hypercall. In such cases, the hypervisor would update the I/O page tables and flush IOTLBs at one time, reducing the time of cache shrinking.
IV. EVALUATION
We have implemented the prototype of PiBooster on our experiment platform. Xen version 4.2.1 is the hypervisor while the guest VM (i.e., Dom0) is Ubuntu version 12.04 with Linux kernel version 3.2.0. The PiBooster added or changed 350 SLoC in the Linux kernel and 166 SLoC in Xen. To fully evaluate the performance and its effects on the whole system, we measured the PiBooster using self-developed tiny tools and selected benchmarks (i.e., SPECint, netperf and lmbench).
A. Experiment Setting
The experiment platform is a LENOVO QiTian M4390 PC with four CPU cores (i.e., Intel Core i5-3470) running at 3.20 GHz. We enable the Intel VT-d feature through BIOS and grub configuration file. Workload Emulation. In order to allow us to repeatedly measure the effects of the PiBooster on 1) page table allocations and deallocations, and 2) the IOTLB flushes, we use a stress tool to explicitly emulate a heavy workload with many short-time and concurrently-running processes. Specifically, the tool periodically launches a browser (i.e., Mozilla Firefox 31.0 in the experiment), continuously opens new tabs one by one, and terminates the browser gracefully. The purpose of these operations is to frequently create and terminate a large number of processes, leading to many page table allocations and deallocations. The frequency can be configured. In our experiment setting, there are 542 processes created and exited per minute. In order to avoid the browser occupying too much memory, we terminate it in every 5 minutes. At this moment, the memory usage of the browser reaches to 284.1 MB on average.
B. Tiny-tool Measurements
The tests of tiny tools are to evaluate the frequency of the IOTLB flushes, the execution time of the page table (de)allocations and the memory usage of the PiBooster cache. For each measurement, there are two control groups and one baseline/normal group. In the baseline group, we run the workload emulation in the guest with default settings, without enabling the PiBooster mechanism. In contrast, the two control groups are: 1) the Pre-PiBooster group, where the PiBooster is enabled before the workload emulation starts, and 2) the Dyn-PiBooster group, where the PiBooster is dynamically enabled (e.g., five minutes after the workload emulation launches). The Dyn-PiBooster group is to evaluate if 1) the PiBooster is able to enter a stable state as the PrePiBooster group, and 2) how fast the PiBooster is able to enter the stable state. Figure 3: The frequency of IOTLB flushes. In the PrePiBooster group, the frequency is reduced from a very low level to zero within 1 minutes. In the Dyn-PiBooster group, the frequency drops sharply within two minutes from the high level to zero. Both control groups indicate that the PiBooster could always enter the stable state.
1) Frequency of IOTLB Flushes:
In this test, we aim to evaluate the effectiveness of the fine-grained validation for reducing the additional IOTLB flushes. We sample the frequency of the IOTLB flushes in 30 minutes. The measurement results are illustrated in Figure 3 . In the baseline group, the frequency of the IOTLB flushes is increasing in the first five minutes, and then keeps at a high flush rate (i.e., 9050 flushes on average per minute) until the test completes. In the Pre-PiBooster group, the flush frequency quickly decreases from a low level (i.e., 332 flushes for the first minute) to zero level in about one minute and keeps at the level. In the Dyn-PiBooster group, the flush frequency sharply decreases to zero when the PiBooster is enabled. The PiBooster roughly spends two minutes entering the stable state. Thus, we can conclude that the fine-grained validation scheme is able to efficiently and effectively eliminate the IOTLB flushes introduced by the DMA validation.
2) Execution Time Measurement: There are three-level guest page tables, and each level has its own (de)allocation functions, e.g., pgd alloc and pgd free for the L1 (top) level. We continuously measure the three-level in 30 minutes, and calculate the average execution time of each function in one minute. Note that in the Dyn-PiBooster group, we enable the PiBooster 5 minutes after the workload starts to run. That is why the measurements of the Dyn-PiBooster group in the first 5 minutes are almost the same with the baseline results.
As shown in Figure 4 , the execution time on average in the Pre-PiBooster group, for a pair of allocation and deallocation of three-level page tables, is 1961 in nanoseconds, while in the baseline group, the corresponding time is 3687 in nanoseconds. Putting them together, the time reduction of the page table allocation and deallocation is roughly 47%. The results also indicate that the PiBooster in the DynPiBooster group and the Pre-PiBooster group can achieve the same performance improvements. The only difference is that the PiBooster in the Dyn-PiBooster group needs a transitional period of 2 minutes to be stable. The Worst Case. When the PiBooster starts to work, the guest kernel always first invokes the PiBooster cache for the page table allocations. If the cached semi-writable pages cannot satisfy the requirements (a.k.a., cache miss), the PiBooster cache would have to allocate writable pages from the existing memory allocators following the traditional paths. In this case, the execution time is the traditional execution path plus the path in the PiBooster cache. As a result, the execution time of the page table allocation is even longer than that of the baseline (Figure 1a) . However, the overhead introduced by the PiBooster path is negligible, as the control flow will immediately return when the cache is empty. More specifically, the overhead consists of the function invocation, the stack adjustments and the checking logic of both the cache list and the page type. Putting them together, the overhead is less than 20 instructions. Fortunately, the worst case does not often occur. According to our observations in the Dyn-PiBooster group, the number of the worst cases is 348, out of 198990 allocation requests in 30 minutes. Note that the page-table deallocation is always successful in a constant time. Table I : Cache usages in both groups of Pre-PiBooster and Dyn-PiBooster are small, occupying 176 pages and 160 pages, respectively.
Levels of Page
3) Memory Usage Measurement:
To clearly observe the page usage in each level, we measure the number of the cached pages in three levels and the results are listed in Table I . In the Pre-PiBooster group, the PiBooster cache has 5, 26 and 145 semi-writable pages ready for allocations, totally occupying 704KB, which is quite similar to the DynPiBooster group, where the cache has 160 pages in total, occupying 640KB. As a result, the PiBooster cache in every group consumes an insignificant memory usage, always less 1MB.
C. Selected Benchmarks
The purpose of the benchmarks is to evaluate the effects of PiBooster on the overall system. All measurements are divided into two groups: 1) the baseline group, and 2) the PiBooster group.
1) Latency of Process Creations and Exits:
Lmbench is used to measure the latencies of process creations and exits (i.e., fork+exit, fork+execve, fork+/bin/sh -c), shown in Table II . PiBooster group reduces time cost by 11%, 
2) SPECINT:
The measurement results of 12 programs in SPECINT2006 [12] are listed in Table III and they display that PiBooster group improves performance by only 0.04%, indicating that it at least has no negative effect on the system.
3) I/O Performance Measurements: IOTLB is used to accelerate the DMA address translation to achieve better performance for I/O devices. Therefore, if there are many IOTLB misses caused by frequent IOTLB flushes, they will introduce negative effects on the I/O performance. However, rIOMMU [13] claims that the overhead caused by walking the IOMMU page tables due to IOTLB misses is so negligible that cannot be measured in the netperf, because the main latency induced by I/O interrupt processing and the TCP/IP stack is several orders of magnitude larger than that of walking the page tables. In addition, Nadav Amit et al. [14] also has similar statements.
In this paper, we test the network I/O and disk I/O performance under regular circumstances, and use these experiments to revisit the relationship between the IOTLB misses and I/O performance. We evaluate the network using netperf running the PiBooster system as the netperf client sending TCP packets. The measurement results show that the throughput is improved by only 0.02% in the PiBooster group, meaning that there is no obvious improvement on network I/O. Similarly, we test the disk I/O using lmbench. The results indicate that the disk I/O speed remains the same.
The above two experiments as new evidence support the observations in [14] , [13] . In fact, the effects of the IOTLB misses are measurable by using an extremely high-speed I/O device, e.g., Intel's I/O Acceleration Technology [15] , or a newly designed IOMMU. For instance, the rIOMMU project [13] [3] , [18] . To fix this gap, the hypervisor has to enable the I/O virtualization (AMDVi [4] or Intel VT-d [5] ) technology, preventing any DMA access to the guest page table [8] . In this paper, we keep page-table based security and accelerate the performance in the page table management. IOTLB Misses Reduction. There are some existing approaches [14] , [13] , [19] analyzing and reducing the negative effects due to the IOTLB misses. Amit et al. [14] firstly analyze the role of the IOTLB in DMA operations and quantify the performance overhead of IOTLB misses. Then they present new strategies of both software and hardware enhancements to reduce IOTLB miss rate in order to facilitate DMA address translation. rIOMMU [13] re-designs the architecture of IOMMU to achieve high performance in DMA transactions, during which the IOTLB misses are also largely reduced. Willmann et al. [19] proposes new strategies for Xen to re-configure the addressing mode of IOMMU, resulting in fewer IOTLB misses. Different from the previous approaches that attempt to reduce the overall IOTLB misses, our approach mainly focuses on eliminating additional IOTLB misses introduced by the security validation (i.e., the DMA validation).
VI. CONCLUSION
The paravirtual guest OS has two important performance issues in page table management : 1) the long execution paths of page table (de)allocations and 2) the additional IOTLB flushes introduced by the DMA validation. In this paper, we proposed the PiBooster system to address the above problems. We shortened the execution paths of the page table (de)allocations by introducing the PiBooster cache. We introduced a fine-grained validation scheme to successfully eliminate additional IOTLB flushes and further reduce the execution paths. We implemented a prototype of the PiBooster and fully evaluated its performance effects. The evaluation results indicated that PiBooster could completely eliminate the additional IOTLB flushes in the workloadstable environments, and effectively reduced (de)allocation time of the page table by 47% on average. Also, the latencies of the process creations and exits were expectedly reduced by 16% on average without negative performance impacts on the system. We believe that the PiBooster can effectively reduce the overall I/O performance overhead in the high-speed settings by eliminating the additional IOTLb misses, thus we plan to rely on the Intel's I/O Acceleration Technology [15] to conduct such experiments in the near future. Besides, we are going to evaluate the effectiveness of PiBooster in a simulated cloud environment where Xen runs on several multi-core machines, hosting multiple virtual machines.
