Abstract-Cycle-level microarchitectural simulation is the defacto standard to estimate performance of next-generation platforms. Unfortunately, the level of detail needed for accurate simulation requires complex, and therefore slow, simulation models that run at speeds that are thousands of times slower than native execution. With the introduction of sampled simulation, it has become possible to simulate only the key, representative portions of a workload in a reasonable amount of time and reliably estimate its overall performance. These sampling methodologies provide the ability to identify regions for detailed execution, and through microarchitectural state checkpointing, one can quickly and easily determine the performance characteristics of a workload for a variety of microarchitectural changes.
I. INTRODUCTION
Execution-driven, detailed simulation is trusted for its accuracy as it can be made as faithful to hardware as desired. Unfortunately, the trade-off for this level of detail is a very slow simulation speed: from 10s to 100s of KIPS for many cycle-level simulators. A turning point in simulation speed occurred when researchers realized that, similar to trace-driven simulation of the past, we can use sampling in execution-driven simulation as well [18] , [23] . When sampling applications, there 1 Andreas Sandberg is now with ARM Research. is no longer a need to simulate the execution of multi-billion instruction workloads in detail when we can now find and simulate just the representative slices. Finding these representative slices, either by recognizing different phases in programs [18] , or by selecting them in a statistically robust way [23] , is largely a solved problem. Further, offline, we can checkpoint the state of the simulated machine just before these slices, eliminating most other simulation overhead apart from the detailed simulation itself. Although we may still see advances in this area, the 1000-fold speedups obtained by the combination of offline checkpointing and sampling execution-driven simulation have undeniably revolutionized how we perform architectural research. Using these techniques, full-system simulations have become practical.
It is natural that one is inclined to view further increases in simulation speed as welcomed, but not revolutionary. However, this is only true if one is content to simulate the same software frozen in a set of checkpoints, over and over (presumably while changing some architectural parameters) without the ability to instill not even the slightest software change. Modern software, alas, is not like this: 2 Table I : Performance of pjbb2005:1x400000 using two Oracle JavaVM versions and two cache sizes.
allowed to change in different simulation runs is further restricted, reducing the overall flexibility to explore design spaces.
To regain such flexibility one needs to take a step back, give up checkpointing and revert to fast forwarding (e.g., functional simulation) to locate representative samples. This is a significant setback, not only because it slows down all simulation runs, but also because it requires the functional warm-up of cache structures.
A. Contributions
To address the growing number of situations where frozen checkpoints are insufficient, we need a new approach to provide a fast, yet flexible, simulation infrastructure. This work, Full Speed Ahead (FSA) and parallel FSA (pFSA), can help address this problem. We recognize that the new challenge in simulation is to allow changes, both to the software and hardware configurations, while maintaining simulation speeds previously realized through sampling (See Figure 1) . We achieve this by:
1) Fast-forwarding at near-native speed though virtualization to eliminate the need for stored checkpoints. 2) Providing a low-overhead warming error estimation to bound the error due to insufficient cache warming. 3) Parallelizing the warmup and detailed simulation of samples and hiding their execution behind fast-forwarding at near-native speed. In this paper we describe how our approach enables high performance simulation without the limitations of frozen software/hardware checkpoints, and evaluate both the resulting performance as well as warmup error. To motivate this work, we first provide a use case that cannot be examined in a reasonable time frame with prior simulation methodologies.
B. The case for fast simulation without checkpoints
Checkpointed sampling methodologies, while instrumental in fast, accurate computer architecture research, have a number of limitations that hold back true hardware/software co-design studies. The following two examples demonstrate the close interaction of hardware and software parameters, and the need for a simulation methodology that enables rapid evaluation of such changes. This is particularly important for research as investigators often do not know ahead of time what simulations they need to perform, making the latency of individual experiments key to productivity. a) Example 1: JavaVM version comparison: Given a new microarchitectural feature, researchers will typically sweep potential hardware parameters to determine the benefits of a new technique. But, not taking into account the impact of different JVM versions might skew the results, as shown in Table I . Using the FSA methodology allows us to quickly switch software versions without the need to generate additional checkpoints.
We can see in Table I that for this benchmark, moving from Java 6 to Java 7 gives us the same performance improvement as moving this Java 6 workload from a 2 MB to an 8 MB cache. Also, using the newer version of the JVM shows that with the same percentage cache size increase, we see a smaller relative performance increase with the newer Java version.
With this simple experiment, assuming a 1 MIPS functional warming rate, it would take us 42 hours of functional warmup to generate the checkpoints needed for this experiment (along with another few hours of detailed simulation time). In just 2 hours with pFSA, we are able to generate all of the results needed for this comparison. While these benchmarks are relatively small, the complete full-system simulation of large-scale hardwaresoftware optimizations now becomes possible.
b) Example 2: Data and instruction cache size evaluation: Java programs rely on a garbage collector to periodically scan and remove old, unused data from the heap. This behavior can be controlled by adjusting the runtime maximum heap size parameter. A logical next step for this evaluation would be to see the effect on garbage collection.
There is a potential for complex interactions between the garbage collector and the microarchitecture which encourages us to ask the questions: What maximum heap size should we use for our microarchitectural experiments, and how much does it affect the overall performance? Initially, the question points us to a data cache experiment, where we sweep the maximum size of the heap together with the size of the lastlevel cache to see their effect on application performance. In Figure 2a , we see that performance is affected more by the size of the L2 data cache when the JVM has a smaller maximum heap size (64MB vs. 512MB). The performance improves by 21.9% for the smaller heap size, but we see a much larger improvement (34.3%) for the larger, 512MB heap size. This leads us to conclude that a reduced heap size reduces both garbage collections and cache pollution, improving performance.
However, we would also like to investigate if other parameters, such as the instruction cache, will affect application performance as we vary the maximum heap size. Our hypothesis is that a higher frequency of garbage collections could cause instruction cache pollution in a similar fashion. Figure 2b shows a different result, with 64 MB and 512 MB heaps improving at approximately the same rate, 11.8% and 9.9% respectively, as we vary the L1 instruction cache size.
With a traditional sampling methodology, these experiments would require the generation of 48 different checkpoints, which would take more than 19 days of execution time for the generation of the checkpoints alone. With pFSA, the experiments took just 10 minutes each, for just 8 hours on one machine for the complete simulation run 2 . Fast, low-latency simulation enables researchers to experiment with new directions quickly and easily, evaluate realistic software configurations across different software versions and configurations, and conduct hardware-software co-design studies that were not possible before.
II. FSA SAMPLING METHODOLOGY Sampled simulation can significantly reduce the amount of time needed for simulation by using a smaller but representative sample of an application [4] , [18] , [21] , [22] , [23] . The resulting detailed sample contains a very small percentage (<1%) of the total number of application instructions, providing the potential for a large simulation speedup. However, to maintain an accurate microarchitecture state between sampling units, especially with regards to caches, a significant number of instructions between sampling units need to be simulated. As a result, the overall simulation time is dominated by this continuous cache warming, while the actual detailed simulation is a relatively small portion of the total simulation time. To overcome this limitation, earlier proposals address the cache warming problem by either functionally warming caches between samples (which provides similar accuracy to detailed cache warming with higher performance), or using checkpoints of microarchitectural state to quickly start simulation at pre-determined points. Functional simulation between samples (either with or without functional cache warming) is still extremely slow compared to native execution, and checkpointing is limited in scope, as each checkpoint is locked to one specific software configuration and cache hierarchy.
In this work, the Full Speed Ahead methodology eliminates the slowdown due to functional simulation between checkpoints by using near-native speed virtualized execution. However, as virtualized execution cannot warm the simulated caches, we add a separate functional cache warming period before each detailed simulation. The key to this approach is that we provide a bound on the error due to cache warming, which allows us to avoid the performance penalty of always-on cache warming while maintaining accuracy. This solution is ideal for hardware-software co-design where there is the need to regularly change both the software and hardware configurations for each simulation.
A. FSA: Hardware-Accelerated Fast-Forwarding and Sampling
Our sampling simulation methodology uses four different modes of execution to balance accuracy and simulation overhead. The first mode, virtualized fast-forward, uses hardware virtualization to advance between samples. This mode executes the vast majority of the instructions in the simulated system using hardware virtualization. The second mode, functional warming, is the fastest simulation mode and executes instructions without simulating timing (functional execution), but simulates caches and branch predictors to ensure that they are in a representative, warm, state. The third mode, detailed warming, simulates the entire system in detail using an out-oforder CPU model without sampling any statistics, to ensure that pipeline structures with short-lived state (e.g., load and store buffers) are in a warm state. The fourth mode, detailed simulation, simulates the system in detail and takes the desired measurements. The interleaving of these simulation modes is shown in Figure 3b .
Our simulation approach is similar to the SMARTS [23] methodology (Figure 3a ), but instead of guaranteeing perfect cache warmup, we provide an estimation of the warmup error that occurs during a shorter functional warming phase. In practice, SMARTS in gem5 [3] executes around 1 MIPS (limited by functional simulation), while FSA simulates common system configurations between 90-600 MIPS across the SPEC CPU2006 benchmarks. While we cannot guarantee that caches are warm when we take a sample, we can rapidly estimate the error incurred by limited functional warming and re-warm with a longer interval if needed. This enables users to tune the warming to get the desired performance-accuracy tradeoff. In the extreme, this methodology will revert to SMARTSlike always-on cache warming for applications which require it, although we have never seen this need in our extensive simulations.
B. Estimating Warmup Error
To enable the use of our virtualized fast-forward mode, we need to ensure that our microarchitectural state, such as the large last-level cache and branch predictors, have been sufficiently warmed before switching to detailed simulation. Intuitively, a cache is warm when all cache blocks are populated with valid data or the working set size of an application is resident in the cache. However, there is no upper bound for how long it takes to guarantee a warm cache state.
As the virtualized fast-forwarding mode cannot warm the simulated state, we have developed a method to estimate the error introduced through limited functional warming before detailed simulation. Instead of focusing on how warm the cache state is before starting a detailed simulation, one could simply ask what impact an insufficiently warmed cache has on the simulated results. If this impact is too large, the experiment can be rerun with additional cache warming. In Section V we will show that this approach of measuring the impact of cache warming can lead to an overall simulation improvement of at least two orders of magnitude compared with the traditional SMARTS always-on warming approach. With such a performance increase, the impact of rerunning a few detailed sample simulations due to insufficient cache warming is negligible.
We can answer the question of what impact an insufficiently warmed cache has on performance by running two simple simulation experiments: an optimistic simulation and a pessimistic simulation (Figure 3d ). For the optimistic simulation, a cache miss to a cold cache set is treated as a cache hit (assuming that that the data would have been there had the cache been sufficiently warmed), while for the pessimistic simulation, a cache miss to a cold cache set is treated as a real cache miss. Typically, simulating the same execution twice is prohibitively slow. However, if both experiments are based on the same cache warming simulation, which dominates the simulation time, the overhead or simulating both experiments is very small.
Our framework allows for simulation state to be replicated efficiently using copy-on-write semantics. This enables us to use the same warmed cache state for both experiments with very low overhead. Since the bulk of the simulation time is spent in cache warming, doubling the detailed simulation only adds 3.9% simulation overhead overall.
C. pFSA: Exploiting Sample-Level Parallelism
Despite executing the majority of the instructions natively through virtualized fast-forwarding, FSA still spends the majority of its time in the non-virtualized simulation modes (typically 75%-95%) to warm and measure sample points. This means that we can reach new sample points much faster than we can warm and simulate a single sample point, which exposes parallelism between samples. Through fast copying of the simulator state for detailed simulation and fast-forwarding to the next sample, we enable parallel sample discovery and detailed simulation. We call this simulation mode Parallel Full Speed Ahead (pFSA) sampling. pFSA has the same execution modes as FSA, but, unlike FSA, the functional and detailed simulation of previous samples execute in parallel while the virtualized fastforwarding runs ahead to generate the next sample (Figure 3c ). Our parallel simulator scales well, indicating that the overhead of state copying is insignificant for the short functional and detailed simulation in each sample (See Section V-B).
D. Summary
Our proposal achieves simulation speeds comparable to native speed (63% of native on average), without being bound by the rigid nature of checkpointing (frozen software and cache state) or continuous (slow) functional warmup. This means that we can simulate much faster than before while retaining the flexibility to change the software and hardware at will. We believe that this flexibility is becoming increasingly important for systems-level researchers, as changing either hardware or software at will is impractical with either checkpointing or full functional warmup.
However, without cache warming during our virtualized fastforward, we lose the ability to switch immediately to detailed warming without functional cache warming. As fast-forwarding approaches near-native speed, functional cache warming again dominates simulation time. This requires significant simulation time to warm the caches, which we hide by executing in parallel while we fast-forward to the next simulation point. To ensure that we have warmed the caches sufficiently, we have developed a low-overhead (3.9%) approach that allows us to rapidly measure and compensate for warming errors.
III. BACKGROUND

A. gem5: Full-System Discrete Event Simulation
Full-system simulators are important tools in computer architecture research as they allow architects to model the performance impact of new features on the whole computer system, including the operating system and I/O devices. To accurately simulate the behavior of a system, they must simulate all important components in the system, including CPUs, the memory system, and the I/O and the storage.
The gem5 [3] simulator is a cycle-level full-system microarchitectural simulator, which provides modules for most components in a modern system. In this paper, we extend gem5 to add support for hardware virtualization through a new virtual CPU module. The virtual CPU module can be used as a drop-in replacement for other CPU modules in gem5, thereby enabling rapid fast-forwarding in our virtualized fastforwarding mode. Since the module supports the same gem5 interfaces as simulated gem5 CPU modules, it can be used for checkpointing and CPU module switching during simulation.
B. Hardware Virtualization
Virtualization solutions have traditionally been employed to run multiple operating system instances on the same hardware. Nevertheless, the goals of virtualization software and traditional computer architecture simulators are very different. One of the major differences is how device models (e.g, disk controllers) are implemented. Traditional virtualization solutions typically prioritize performance, while architecture simulators focus on accurate timing and detailed hardware statistics. Timing sensitive components in virtual machines typically follow the real-time clock in the host, which means that they follow wall-clock time rather than a simulated time base. Integrating support for hardware virtualization into a simulator such as gem5 requires one to ensure that the virtual machine and the simulator have a consistent view of devices, time, memory, and CPU state.
IV. PFSA IMPLEMENTATION DETAILS
In this section, we describe the details of our sampling methodology, including virtualization, parallel simulation of representative regions, and efficient handling of cache warming. This infrastructure has been committed into the master branch of the gem5 repository at http://gem5.org. While our implementation is gem5-specific, we believe that the techniques used are portable to other simulation environments.
A. Hardware Virtualization
Our goal is to accelerate simulation by off-loading the vast majority of the instructions executed in the simulated system to the hardware CPU. This is accomplished by our virtual CPU module using hardware virtualization extensions to execute code natively at near-native speed. We designed the virtual CPU module to allow it to work as a drop-in replacement for the other CPU modules in gem5 (e.g., the OoO CPU module) and to only require standard features in Linux. This means that it supports gem5 features like CPU module switching during simulation and runs on off-the-shelf Linux distributions.
Integrating hardware virtualization in a discrete event simulator requires that we ensure consistent handling of a number of different aspects of the system, such as the simulated devices, managing time, memory and finally the processor state. Each of these issues are discussed in detail below.
Consistent Devices: We interface the virtual CPU with gem5's device models (e.g., disk controllers, displays, etc.), which allows the virtual CPU to use the same devices as the simulated CPUs. Memory accesses to IO devices and IO instructions are intercepted by the virtualization layer, which stops the virtual CPU and hands over control to gem5. In gem5, we synthesize a memory access that is inserted into the simulated memory system, allowing the access to be seen and handled by gem5's device models.
Consistent Time: Simulating time can be challenging because device models (e.g., timers) execute in simulated time, while the virtual CPU executes in real time. Traditional virtualization environments solve this issue by running device models in real time as well. We address the difference in timing requirements between the virtual CPU and the gem5 device models by restricting the amount of time the virtual CPU is allowed to execute between simulator events.
Consistent Memory: Interfacing between the simulated memory system and the virtualization layer is necessary to transfer state between the virtual CPU module and the simulated CPU modules. The memory regions must map properly between the simulator and the host, and therefore we set up the virtual machine to correspond to the memory mappings used by gem5. Additionally, care must be taken to write back and invalidate all simulated caches when switching to the virtual CPU to maintain a consistent state. Finally, accesses to memory-mapped IO devices need to be trapped and simulated by gem5.
Consistent State: As there are some differences in how gem5 and the real hardware handle processor state (for example, in gem5, the x86 flag register is split across several internal registers to allow more efficient dependency tracking in the OoO pipeline model). We have implemented state conversion routines to give gem5 access to the processor state using the same APIs as the simulated CPU modules. This enables online switching between virtual and simulated CPU modules as well as simulator checkpointing and restarting.
B. Fast Simulation State Cloning
Exposing the parallelism available in a sampling simulator requires us to be able to overlap the simulation of multiple samples. When taking a new sample, the simulator needs to be able to start a new worker task that executes the detailed simulation using a copy of the simulator state at the time the sample was taken. Copying the state to the worker can be challenging since the state of the system (registers and RAM) can be large. There are methods to limit the amount of state the worker needs to copy [21] , but these can complicate the handling of miss-speculation. We chose to leverage the host operating system's copy-on-write (CoW) functionality to provide each sample with its own copy of the full system state. This CoW functionality is used to implement the pessimistic and optimistic warming analysis discussed in Section II-B. A small difference between the optimistic and pessimistic cases indicates that the sample has been sufficiently warmed and can be used for detailed simulation.
We currently support bounding cache warmup error, where the optimistic and pessimistic cases differ in the way we treat warming misses, i.e. misses that occur in sets that have not been fully warmed. In the optimistic case, we assume all warming misses are hits (i.e., sufficient warming). This overestimates the performance of the simulated cache since some of the hits might have been capacity misses. In the pessimistic case, we assume that warming misses are actual misses (i.e., worst-case for insufficient warming). This may underestimate the performance of the simulated cache as some of the misses might have been hits had the cache been fully warmed. Together, these pessimistic and optimistic detailed simulations show the performance impact we would experience from insufficient warming. If these results differ significantly from the regular simulation, we can then adjust the warming time and re-simulate.
V. EVALUATION
To evaluate pFSA we investigate two key characteristics: accuracy of sampled simulation and performance. To evaluate the accuracy of our proposed sampling scheme, we compare the results of a traditional, non-sampling, reference simulation of the first 30 billion instructions of the benchmarks to sampling using a gem5-based SMARTS implementation and pFSA. We show that pFSA can estimate the IPC of the simulated applications with an average error of 2.0%. To investigate sources of the error, we investigate the impact of cache warming on accuracy. Finally, we evaluate scalability in a separate experiment where we show that our parallel sampling method scales almost linearly up to 28 cores.
For our experiments we simulated a 64-bit x86 system (Debian Wheezy with Linux 3.2.44) with split 2-way 64 kB L1 instruction and data caches and a unified 8-way 2MB or 8MB L2 cache with a stride prefetcher. The simulated CPU uses gem5's OoO CPU model. See Table II for a summary of the important simulation parameters. We compiled all benchmarks with GCC 4.6 in 64-bit mode with x87 code generation disabled 3 . We evaluated the system using the SPEC CPU2006 benchmark suite with the reference data set. All simulation runs were started from the same checkpoint of a booted system. Simulation execution rates are shown running on an 8-core 2.3 GHz Intel Xeon E5520.
SMARTS, FSA, and pFSA all use a common set of parameters to control how much time is spent in their different execution modes. In all sampling techniques, we executed 30,000 instructions in the detailed warming mode and 20,000 instructions in the detailed simulation mode. The length of detailed warming was chosen according to the method in the original SMARTS work [23] and ensures that the OoO pipeline is warm. Functional warming for FSA and pFSA was determined heuristically to require 5 million and 25 million instructions for the 2 MB L2 cache and 8 MB L2 cache configurations, respectively. Methods to automatically select appropriate functional warming have been proposed [10] , [19] by other authors and we outline a method leveraging our warming error estimates in the future work section. Using these parameters, we took 1000 samples per benchmark. Due to the slow reference simulations, we limit accuracy studies to the first 30 billion instructions from each of the benchmarks, which corresponds to roughly a week's worth of simulation time in the OoO reference. For these cases, the sample period was adjusted to ensure 1000 samples in the first 30 billion instructions.
A. Accuracy
The degree of accuracy needed depends on what questions the user is asking. In many cases, especially when sampling, accuracy can be traded off for performance. The selected sampling parameters strike a balance between accuracy and performance when estimating the average CPI of an application.
All sampling methodologies that employ functional warming suffer from two main sources of errors: sampling errors and inaccurate warming. Our SMARTS and pFSA experiments have been set up to sample at the same instruction counts, which implies that they should suffer from the same sampling error 4 . Functional warming incurs small variations in the access streams seen by branch predictors and caches since it does not include effects of speculation or reordering. This can lead to a small error, which has been shown [23] to be in the region of 2%. The error incurred by these factors is the baseline SMARTS error, which in our experiments is 1.87% for a 2 MB L2 cache and 1.18% for an 8 MB L2 cache.
Another source of error is the limited functional warming of branch predictors and caches in FSA and pFSA. In general, our method provides very similar results to our gem5-based SMARTS implementation (see Figure 4) . However, there are a few cases (e.g., 456.hmmer) when simulating a 2 MB cache where we did not apply enough warming. A large estimated warming error (as seen in Figure 4a for 456.hmmer) generally indicates that a benchmark should have had more functional warming applied. If the error is too high, rerun the simulation with a longer warming period. The two completed simulations will still run approximately two orders of magnitude faster than the gem5-based SMARTS simulation.
To better understand how warming affects the predicted IPC bound, we simulated two benchmarks with different warming behaviors (456.hmmer & 471.omnetpp) with different amounts of cache warming. Figure 5 shows how the estimated IPC error due to warming shrinks as the amount of cache warming increases. While 471.omnetpp only requires two million instructions to reach an estimated warming error less than 1%, 456.hmmer requires more than 10 million instructions to reach the same goal. This analysis shows how we are able to accurately bound the amount of warmup needed, which then allows us to increase warming to reduce error as needed.
B. Performance & Scalability
As discussed in Section I, to enable hardware-software codesign in simulation, sampling methodologies using frozen microarchitectural checkpoints are no longer sufficient. To address this, FSA has returned to functional simulation, using virtualized execution to speed up fast-forwarding and parallel simulation to accelerate functional warming and detailed simulation. Figure 6 compares the execution rates of native execution, virtual fast-forwarding (VFF), FSA, and pFSA when simulating a system with a 2MB and 8MB last-level cache. The reported performance of pFSA does not include warming error estimation, which adds 3.9% overhead on average. The achieved simulation rate of pFSA depends on three factors. First, fast-forwarding using VFF runs at near-native (90% on average) speed, which means that the simulation rate of an application is limited by its native execution rate regardless of parallelization. Second, each sample incurs a constant cost: the longer a benchmark is, the lower the average overhead. Third, large caches need more functional warming, and the longer the functional warming, the greater the cost of the sample. As seen when comparing the average simulation rates for a 2 MB cache and an 8 MB cache, simulating a system with larger caches incurs a larger overhead.
The difference in functional warming length results in different simulation rates for 2MB and 8MB caches. While the 8MB cache simulation is slower to simulate than the smaller cache, there is also more parallelism available. Looking at the simulation rate when simulating a 2MB cache as a function of the number of threads used by the simulator (Figure 7 ) for a fast (416.gamess) and a slow (471.omnetpp) application, we see that both applications scale almost linearly until they reach 93% and 45% of native speed respectively. The larger cache on the other hand starts off at a lower simulation rate and scales linearly until all cores in the host system are occupied. We estimate the overhead of copying simulation state (Fork Max) by removing the simulation work in the child and keeping the child process alive to force the parent process to do CoW while fast-forwarding. This is an estimate of the speed limit imposed by parallelization overheads. In order to understand how pFSA scales on larger systems, we ran the scaling experiment on a 4-socket Intel Xeon E5-4650 with a total of 32 cores. We limited this study to the 8MB cache since simulating a 2MB cache reached near-native speed with only 8 cores. As seen in Figure 8 , both 416.gamess and 471.omnetpp scale almost linearly until they reach their maximum simulation rate, peaking at 84% and 48.8% of native speed, respectively. 
VI. RELATED WORK
Our parallel sampling methodology builds on ideas from three different simulation and modeling techniques: virtualization, sampling, and parallel profiling. We extend and combine ideas from these areas to form a fully-functional, efficient, and scalable full-system simulator using the well-known gem5 [3] simulation framework.
A. Hardware Virtualization
There have been several earlier attempts at using virtualization for full-system simulation. Rosenblum et al. pioneered the use of virtualization-like techniques with SimOS [15] that ran a slightly modified version of Irix as a UNIX process and simulated privileged instructions in software. PTLsim [24] by Yourst, used para-virtualization 5 to run the target system natively. Due to the use of para-virtualization, PTLsim requires the simulated operating system to be aware of the simulator. The simulated system must therefore use a special paravirtualization interface to access page tables and certain lowlevel hardware. This also means that PTLsim does not simulate low-level components like timers and storage components (disks, disk controllers, etc.). Both SimOS and PTLsim use a fast virtualized mode for fast-forwarding and support detailed processor performance models. The main difference between them and gem5 with our virtual CPU module is the support for running unmodified guest operating systems. Additionally, since we only depend on KVM, our system can be deployed with unmodified host operating systems.
An interesting new approach to virtualization was taken by Ryckbosch et al. in their VSim [16] proposal. This simulator mainly focuses on IO modeling for cloud based workloads. Their approach employs time dilation to simulate slower or faster CPUs by making interrupts happen more or less frequently relative to the instruction stream. Since the system lacks a detailed CPU model, there are no facilities for detailed simulation or auto-calibration of the time dilation factor. In many ways, the goals of VSim and pFSA are very different: VSim focuses on fast modeling of large IO-bound workloads, while pFSA focuses on sampling of detailed micro-architecture simulation.
B. Software/Hardware Emulation
An alternative to hardware virtualization is software emulation. One popular emulation platform is QEMU, which in this context allows for full-system emulation in a user-space process as the MARSS [13] simulator does. pFSA's hardware virtualization is dramatically faster than software virtualization, and can achieve simulation speeds of 2,000 MIPS compared to 0.2 MIPS for the QEMU-based MARSS simulator. While it would be possible for QEMU-based simulators to use hardware virtualization, without a means to bound warmup error, as pFSA does, they would see little benefit due to the need for continuous warming.
The overhead of functional simulation was targeted with the ProtoFlex FPGA simulation platform [5] . Unlike pFSA, FPGA-based approaches do not enable easy parallel execution of sample points through CoW state replication and are harder to customize than software-based approaches.
C. Sampling
Techniques for sampling simulation have been proposed many times before [23] , [21] , [4] , [22] , [18] , [7] , [1] . The two main techniques are SimPoint [18] and SMARTS [23] . While both are based on sampling, SimPoint uses a very different approach compared SMARTS and pFSA that builds on checkpoints of representative regions of an application.
Such regions are automatically detected by finding phases of stable behavior. In order to speed up SMARTS, Wenisch et al. proposed TurboSMARTS [21] , which uses compressed checkpoints that include cache state and branch predictor state. The primary drawback of these approaches is that their checkpoints represent frozen software and hardware configurations. This makes them particularly unsuitable for applications such as hardware-software co-design or operating system development. Since pFSA uses virtualization instead of checkpoints to fastforward between samples, there is no need to perform costly simulations to regenerate checkpoints when making changes in the simulated system. SMARTS has the nice property of providing statistical error bounds for sampling accuracy. These guide users who strictly follow the SMARTS methodology that their sampled IPC will be within a certain degree of accuracy with a particular confidence. Since we do not perform always-on cache and branch predictor warming, we cannot provide the same guarantees, but we achieve similar accuracy in practice. To identify problems with insufficient warming, we have proposed a low-overhead approach that can estimate the warming error, and our approach makes re-simulating with increased warmup relatively fast.
The sampling approach most similar to FSA is the one used in COTSon [1] by HP Labs. COTSon combines AMD SimNow [2] (a JITing functional x86 simulator) with a set of performance models for disks, networks, and CPUs. The simulator uses a dynamic sampling strategy [7] that uses online phase detection to exploit phases of execution in the target. Since the functional simulator they use cannot warm microarchitectural state, they employ a two-phase warming strategy similar to FSA. However, unlike FSA, they do not use hardware virtualization to fast-forward execution, instead they rely on much slower (10x overhead [1] compared to 10% using virtualization) functional simulation.
D. Parallel Simulation
There have been many approaches to parallelizing simulators. We use a coarse-grained high-level approach in which we exploit parallelism between samples. A similar approach was taken in SuperPin [20] and Shadow Profiling [12] , which both use Pin [9] to profile user-space applications and run multiple parts of the application in parallel. Shadow Profiling aims to generate detailed application profiles for profile guided compiler optimizations, while SuperPin is a general-purpose API for parallel profiling in the Pin instrumentation engine. Our approach to parallelization draws inspiration from these two works and uses parallelism to overlap detailed simulation of multiple samples with native execution. The biggest difference is that we apply the technique to full-system simulation instead of user-space profiling.
Another approach to parallelization is to parallelize the core of the simulator. Parallel discrete event simulation (PDES) [8] techniques, as seen in the Wisconsin Wind Tunnel [14] and Graphite [11] simulator, as well as ZSim [17] which simulates applications in two interleaved phases, are orthogonal to the pFSA methodology and could be combined to expose additional parallelism.
VII. FUTURE WORK
Extending this work to multicore simulation is the obvious next step now that we have developed a technique to do extremely fast single-core simulation. Many of the existing approaches for handling synchronization between simulated parallel threads (e.g., Graphite, Sniper, etc.) are directly applicable to this work, although pFSA's ability to generate checkpoints and re-simulate very cheaply open up new directions for even more efficient and accurate multicore simulation.
We are also looking into ways of extending warming error estimation to TLBs and branch predictors. An interesting application of warming estimation is to quickly profile applications to automatically detect per-application warming settings that meet a given warming error constraint. Additionally, an online implementation of dynamic cache warming could use feedback from previous samples to adjust the functional warming length on the fly and use our efficient state copying mechanism to roll back samples with too short functional warming.
VIII. SUMMARY
In this work we have addressed the need to move away from frozen software and architectural checkpoints to enable rapid simulation evaluations involving both hardware and software modifications. Our methodology provides the flexibility and speed needed for hardware/software co-design, a necessary next step given the complexity and dynamic nature of current software, through the combination of fast simulation, parallel execution, and the novel bounding of cache error.
We have evaluated this methodology with the SPEC CPU 2006 benchmark suite, and have demonstrated high accuracy (IPC error of 2.2% and 1.9% when simulating 2MB and 8MB L2 caches respectively) and near-native performance (63% or 25% of native depending on cache size). Compared to detailed simulation, our parallel sampling simulator results in 7,000x-19,000x speedup.
