Architectural simulation is extremely time-consuming given the huge number of instructions that need to be simulated for contemporary benchmarks. Sampled simulation that selects a number of samples from the complete benchmark execution yields substantial speedups. However, there is one major issue that needs to be dealt with in order to minimize non-sampling bias, namely the hardware state at the beginning of each sample. This is well known in the literature as the cold-start problem. The hardware structures that suffer the most from the cold-start problem are cache hierarchies. In this paper, we propose NSL-BLRL, which combines two previously proposed cache hierarchy warmup approaches, namely: no-state-loss (NSL) and boundary line reuse latency (BLRL). The idea of NSL-BLRL is to warmup the cache hierarchy using a hardware state checkpoint that stores a truncated NSL stream. The NSL stream is a least-recently used stream of (unique) memory references in the pre-sample. This NSL stream is then truncated to form the NSL-BLRL warmup checkpoint; this is done by inspecting the sample for determining how far in the pre-sample one needs to go back to accurately warmup the hardware state for the given sample. We show using SPEC CPU2000 benchmarks that NSL -BLRL is (i) nearly as accurate as BLRL and NSL for sampled processor simulation, (ii) yields simulation time speedups of several orders of magnitude compared to BLRL and (iii) is more space-efficient than NSL. As such, we conclude that NSL-BLRL is a highly efficient and accurate cache warmup strategy for sampled processor simulation.
INTRODUCTION
Current microarchitectural research and microprocessor development relies heavily on cycle-level architectural simulations. Cycle-level simulations model a microarchitecture at a fairly detailed level while executing real-life applications. The price paid for such detailed simulations of real-life benchmarks obviously is simulation speed. Simulating a full benchmark execution can take days or even weeks for completion. If we take into account that during microarchitectural research and microprocessor development a multitude of design alternatives need to be evaluated, we easily end up with months or even years of simulation. As such, detailed simulation of full benchmark executions during design space exploration is infeasible.
Several approaches have been proposed in the recent literature to address this issue. One particular proposal is sampled simulation [1 -9] . Sampled simulation selects a number of execution intervals from a complete benchmark execution, called samples, to be simulated. Since the number of samples and their sizes are limited, significant simulation speedups are obtained. However, there is one particular issue that needs to be dealt with, namely the cold-start problem. The cold-start problem refers to the unknown hardware state at the beginning of each sample. An attractive solution to the cold-start problem is to simulate a number of instructions from the pre-sample without computing performance metrics. The pre-sample is the set of contiguous instructions before the sample, i.e. from the end of the previous sample until the beginning of the current sample. This is to warmup large hardware structures so that the hardware state at the beginning of the sample is a close estimate of what a detailed simulation would reach at the beginning of the sample in case the full benchmark would have been simulated.
THE COMPUTER JOURNAL, 2007
# The Author 2007. Published by Oxford University Press on behalf of The British Computer Society. All rights reserved.
For Permissions, please email: journals.permissions@oxfordjournals.org doi:10.1093/comjnl/bxm061
The Computer Journal Advance Access published September 4, 2007
Owing to the extremely long history in microarchitectural state (as is the case for large caches), the warmup phase needs to be proportionally long. Since the warmup phase can be a significant part of the total sampled simulation time, it is important to study efficient but accurate warmup strategies. Reducing the warmup length can yield significant simulation speedups. Several warmup proposals have been made in the literature. no-state-loss (NSL) [1] , memory reference reuse latency (MRRL) [10] , boundary line reuse latency (BLRL) [11] , memory hierarchy state (MHS) [12] and the TurboSMARTS' live-points approach [13] are the most accurate and flexible approaches existing today. This paper shows that NSL can be combined with BLRL into an efficient warmup strategy called NSL -BLRL for warming cache hierarchies in sampled processor simulation. NSL -BLRL outperforms previously proposed warmup strategies. The NSL -BLRL approach that we propose is more than two orders of magnitude more efficient in terms of warmup memory references compared to the best performing contiguous warmup approaches such as BLRL while achieving the same accuracy. Compared to NSL, NSL-BLRL requires 30% less disk storage. This is an important issue when numerous samples along with their warmup info need to be stored on disk; the total amount of disk space requirements might become very large. In addition, limiting the size of the warmup info stored on disk also reduces simulation time; reading the warmup info from disk and transferring over a network for the parallel simulation of the various samples on a cluster of machines, can be done substantially faster. Compared to MHS and TurboSMARTS' live-points, NSL-BLRL features the advantage of being independent of the cache line sizes. Whenever another cache line size is of interest for MHS and TurboSMARTS' live-points during design space exploration, the warmup needs to be recomputed. This is not the case for NSL -BLRL. This paper is organized as follows. We first revisit sampled processor simulation in Section 2 after which we discuss previously proposed cache warmup strategies in Section 3. Section 4 then details on our newly proposed cache warmup strategy, called NSL -BLRL. After describing our experimental setup in Section 5, we then evaluate NSL -BLRL in Section 6. Finally, we conclude in Section 7.
SAMPLED PROCESSOR SIMULATION
In sampled processor simulation, a number of samples are chosen from a complete benchmark execution (see Fig. 1 ). The instructions between two samples are called the presample. Sampled simulation only uses the instructions in the sample to report performance results; instructions in the presample are not considered.
There are basically two issues with sampling. The first issue is the selection of representative samples. The problem is to select samples in such a way that the sampled execution is an accurate picture of the complete execution of the program. As such, it is important not to limit the selection of samples to the initialization phase of the program execution. This is a manifestation of the more general observation that a program goes through various phases of execution and that the sampling should reflect this notion. In other words, samples should be chosen in such a way that all major phases are represented in the sampled execution. Several approaches have been described in the recent literature to select such samples: random sampling by Conte et al. [1] , profile-driven sampling by scaling the basic block execution counts by Dubey and Nair [2] , selecting basic blocks with representative context information using the R-metric by Iyengar et al. [3, 4] , periodic selection as done in SMARTS [9] , selection based on clustering similarly behaving intervals as done by Lafage and Seznec [5] as well as in SimPoint [6] [7] [8] .
The second issue next to the selection of representative samples is the correct hardware state at the beginning of each sample. This is well known in the literature as the coldstart problem. At the beginning of a sample, the correct hardware state is unknown since the microarchitecture state is not simulated for instructions in the pre-sample in case of sampled processor simulation. Several techniques have been proposed in the literature to address this important issue (see Section 3 for a more detailed discussion on existing warmup strategies).
Most of these use a number of instructions preceding the sample to warmup hardware state before each sample. Under such a warmup strategy, sampled simulation consists of three steps (see also Fig. 1 ). The first step is cold simulation in which the program execution is fast-forwarded, i.e. functional simulation without updating microarchitectural state. The second step is warm simulation, which updates microarchitectural state. This is typically done for large hardware structures such as caches, translation lookaside buffers (TLBs), branch predictors, etc. Under warm simulation, no performance metrics are calculated. It is important to note that the warm simulation phase can be very long since microarchitectural state can have an extremely long history. The third step is hot simulation, which includes detailed processor simulation while computing performance metrics, e.g. calculating cache and branch predictor miss rates, number of instructions retired per cycle, etc. These three steps are repeated for each sample. Note that sampled simulation can be applied to both tracedriven as well as execution-driven simulation. Trace-driven simulation consumes the instructions from a trace stored on disk. Execution-driven simulation consumes a binary along with an input and emulates the program execution while simulating. In case of trace-driven simulation, the instructions under cold simulation are discarded from the trace, i.e. need not to be stored on disk. The warm and hot simulation instructions though need to be stored on disk. For sampled execution-driven simulation, we need to make a distinction between the fast-forwarding and the architecture state checkpointing approaches. Under fast-forwarding, the cold simulation instructions are functionally simulated-instructions are emulated one at a time-and the warm simulation instructions are simulated using specialized functional simulators that warmup the hardware state. Under architecture state checkpointing, there are no cold simulation instructions to be simulated. The architectural state is loaded from the architectural state checkpoint. This architectural state checkpoint guarantees a correct architectural state for both the register and the memory state. Recent work has proposed efficient architectural state checkpointing techniques (see, e.g. [12] [13] [14] ). Once the architectural state is loaded, warm simulation starts through specialized functional simulation or through hardware state checkpointing as will be discussed in the following section. Checkpointed sampling is especially attractive for the parallel simulation of the various samples on a cluster of machines [15] [16] [17] . The various samples along with their checkpoints can be distributed across the cluster for parallel simulation.
WARMUP STRATEGIES
A number of warmup strategies have been proposed, which we revisit in this section. We first detail a number of previously proposed warmup strategies that have fallen into disfavor because they are not as accurate or efficient as the ones that will be discussed in Sections 3.1 -3.3.
(i) The cold or no warmup scheme [18, 19] Haskins and Skadron [21] determines the warmup length as follows. First, the user specifies the desired probability that the cache state at the beginning of the sample under warmup equals the cache state under perfect warmup. Second, the MSE formulas are used to determine how many unique references are required during warmup. Third, using a memory reference profile of the pre-sample it is calculated where exactly in the pre-sample the warmup should get started in order to cover these unique references.
We now discuss a number of warmup strategies, which are fairly accurate and efficient, namely: MRRL, BLRL and a number of hardware state checkpointing techniques including NSL.
CACHE WARMUP FOR SAMPLED PROCESSOR SIMULATION THROUGH NSL -BLRL Page 3 of 15

Memory reference reuse latency
Haskins and Skadron [10] propose MRRL for accurately warming up hardware state at the beginning of each sample. As suggested, MRRL refers to the number of instructions between consecutive references to the same memory location, i.e. the number of instructions between a reference to address A and the next reference to A. For their purpose, they divide the pre-sample/sample pair into N B non-overlapping buckets each containing L B contiguous instructions; in other words, the total pre-sample/sample pair consists of N B . L B instructions; (see also Fig. 2 ). The buckets receive an index from 0 to N B 2 1 in which index 0 is the first bucket in the pre-sample. The first N B,P buckets constitute the pre-sample and the remaining N B,S buckets constitute the sample; obviously,
The MRRL warmup strategy also maintains N B counters c i (0 i , N B ). These counters c i will be used to build the histogram of MRRLs. Through profiling, the MRRL is calculated for each reference and the associated counter is updated accordingly. For example, for a bucket size L B ¼ 10 000 (as is used by Haskins and Skadron [10] ) an MRRL of 124 534 will increment counter c 12 . When the complete pre-sample/sample pair is profiled, the MRRL histogram p i , 0 i , N B is computed. This is done by dividing the bucket counters with the total number of references in the pre-sample/sample pair, i.e.
Not surprisingly, the largest p i 's are observed for small values of i due to the notion of temporal locality in computer program address streams. Using the histogram p i , Haskins and Skadron calculate the bucket corresponding to a given percentile K%, i.e. bucket k for which P m¼0 k21 p m , K% and P m¼0 k p m ! K%. This means that of all the references in the current pre-sample/ sample pair, K% have a reuse latency that is smaller than k . L B . As such, Haskins and Skadron define these k buckets as their warmup buckets. In other words, warm simulation is started k . L B instructions before the sample.
An important disadvantage of MRRL is that if there is a mismatch in the MRRL behavior in the pre-sample versus the sample, that might result in a suboptimal warmup strategy in which the warmup is either too short to be accurate or too long for the attained level of accuracy. For example, if the reuse latencies are generally larger in the sample than in the presample/sample pair, the warmup will be too short and by consequence, the accuracy might be poor. Reverse, if reuse latencies are generally shorter in the sample than in the pre-sample/ sample pair, the warmup will be too long for the attained level of accuracy. One way of solving this problem is to choose the percentile K% large enough. The result is that the warmup will be longer than needed for the attained accuracy.
Boundary line reuse latency
BLRL [11, 22] is quite different from MRRL although it is also based on reuse latencies. In BLRL, the sample is scanned for reuse latencies that cross the pre-sample/sample boundary line, i.e. a memory location is referenced in the pre-sample and the next reference to the same memory location is in the sample. For each of these cross BLRLs, the pre-sample reuse latency is calculated. This is done by subtracting the distance in the sample from the MRRL. For example, if instruction i has a cross BLRLs, the pre-sample reuse latency then is x 2 (i 2 N B,P . L B ); (see Fig. 3 ). A histogram is builtup using these pre-sample reuse latencies. As is the case for MRRL, BLRL uses N B,P buckets of size L B to limit the size of the histogram. This histogram is then normalized to the number of reuse latencies crossing the pre-sample/sample boundary line. The required warmup length is then computed to include a given percentile K% of all reuse latencies that cross the pre-sample/ sample boundary line. There are three key differences between BLRL and MRRL. First, BLRL considers reuse latencies for memory references originating from instructions in the sample only whereas MRRL considers reuse latencies for memory references originating from instructions in both the pre-sample and sample. Second, BLRL only considers reuse latencies that cross the pre-sample/sample boundary line; MRRL considers all reuse latencies. Third, in contrast to MRRL which uses the reuse latency to update the histogram, BLRL uses the pre-sample reuse latency. Previous work [11] has shown that BLRL substantially outperforms MRRL; the warmup length of BLRL is nearly half the warmup length of MRRL for the same level of accuracy.
Hardware state checkpointing
Another approach to the cold-start problem is to checkpoint or to store the hardware state at the beginning of each sample and impose this state during sampled simulation. This approach yields perfectly warmed up hardware state. However, the storage needed to store these checkpoints can explode in case many samples are required. In addition, the hardware state needs to be stored for each hardware configuration of interest. For example, for each cache and branch predictor configuration, a checkpoint needs to be made. Obviously, the latter constraint implies that the complete program execution needs to be simulated for these various hardware structures. Since this is infeasible to do in practice, researchers have proposed more efficient approaches to hardware state checkpointing. One example is the NSL approach [23, 24] , which scans the pre-sample and records the latest reference to each unique memory location in the pre-sample. This is the stream of unique memory references as they occur in the memory reference stream sorted by their least recently uses. In fact, NSL keeps track of all the memory references in the pre-sample and then retains the last occurrence of each unique memory reference. We will call the obtained stream the least recently used (LRU ) stream. For example, the LRU stream of the following reference stream 'ABAACDABA' is 'CDBA'. The LRU stream can be computed by building the LRU stack for the given reference stream. An LRU stack operates as follows: when address A from a reference stream is not present on the stack, it is pushed onto the stack. When on the other hand, address A is present on the stack, it is removed from the stack and repushed onto the stack. As such, it is easily understandable that both reference streams, the original reference stream as well as the LRU stream, yield the same state when applied to an LRU stack. The NSL warmup method exploits this property by computing the LRU stream of the pre-sample and applying this stream to the cache as warmup. By consequence, the NSL warmup strategy yields perfect warmup for caches with an LRU replacement policy.
Barr et al. [25] extended this approach for reconstructing the cache and directory state during sampled multiprocessor simulation. In order to do so, they keep track of a timestamp per unique memory location that is referenced. In addition, they keep track of whether accessing the memory location originates from a load or a store operation. This information allows them to quickly build the cache and directory state at the beginning of each sample.
Van Biesbrouck et al. [12] proposed the MHS approach. Wenish et al. [13] proposed a similar approach, called livepoints, in TurboSMARTS; we will collectively refer to MHS and the TurboSMARTS' live-points approach as MHS. In MHS, the largest cache of interest is simulated once for each sample. The cache's content is then stored on disk. The content of smaller-sized caches can then be derived from the checkpoint. The disadvantage of this approach compared to NSL is that MHS requires the cache line size to be fixed. Whenever a cache needs to be simulated with a different cache line size, the warmup info needs to be recomputed. NSL does not have this disadvantage. The advantage of MHS over NSL however is that it is more space efficient, i.e. a smaller disk space is required for storing the warmup info. The reason is that NSL stores all unique pre-sample memory references; MHS on the other hand, discards conflicting memory references from the warmup info for a given maximum cache size. A second advantage of MHS over NSL is that computing the MHS warmup is done faster than computing the NSL warmup info; NSL does an LRU stack simulation whereas MHS only simulates one particular cache configuration.
Comparing our NSL-BLRL approach against existing hardware state checkpointing techniques we conclude that (i) NSL -BLRL is more space-efficient than NSL, i.e. requires less disk space than NSL, and (ii) NSL -BLRL is more broadly applicable during design space exploration than MHS and the TurboSMARTS' live-points approach because the NSL-BLRL warmup info is independent of the cache block size.
The major advantage of hardware state checkpointing is that it is an extremely efficient warmup strategy, especially in combination with checkpointed sampling. In practice, hardware state checkpointing basically trades the warm simulation phase, see Fig. 1 , for loading the hardware state checkpoint. And this is much more efficient in terms of simulation time than warm simulation. Using machine state checkpointing in combination with hardware state checkpointing leads to highly efficient simulation approaches that can simulate entire benchmarks in minutes [12 -14] . In addition, checkpointed sampling is the preferred method for parallel sampled simulation where the simulation of samples is distributed over a cluster of machines.
CACHE WARMUP FOR SAMPLED PROCESSOR SIMULATION THROUGH NSL -BLRL Page 5 of 15
Note that efficient architectural checkpointing techniques can be implemented for cache hierarchies including TLB structures, and other cache-like structures such as the branch target buffer (BTB). However, architectural checkpointing techniques for taken/not-taken branch predictors that can be re-used over a number of branch predictors are not easy to do. Therefore, researchers have proposed a pragmatic approach by storing the contents of the branch predictors of interest as a checkpoint.
NSL -BLRL: COMBINING NSL AND BLRL
This paper proposes to combine the NSL warmup method with BLRL into NSL -BLRL. This is done by computing both the LRU stream as well as the BLRL warmup buckets corresponding to a given percentile K%. Only the unique references that are within the warmup buckets will be used to warmup the caches. This could be viewed as pruning the LRU stream with BLRL information. Reverse, this method could also be viewed as selecting the LRU stream from the BLRL warmup buckets. Note that computing the NSL-BLRL warmup instructions does not significantly increase the complexity of the warmup procedure. In fact, both can be integrated in a straightforward way. Computing the LRU stream, by construction, requires building and maintaining an LRU stack, and searching the LRU stack for the last reference to a given memory location can be done efficiently using a hash table; the hash table uses a memory address as its index and returns a pointer to an LRU stack entry. This same hash table can also be used to (simultaneously) identify the last reference in the dynamic instruction stream to that same memory location -next to returning a pointer to the LRU stack, the hash table then as well returns the position in the dynamic instruction stream; the location of that last reference in the dynamic instruction stream compared to the current memory access then determines the BLRL distance. In other words, both warmup analyses, NSL and BLRL, can be integrated in an easy-to-implement way.
Using NSL -BLRL as a warmup approach then works as follows in practice. The reduced LRU stream as it is obtained through NSL -BLRL is to be stored on disk as a hardware state checkpoint. Upon simulation of a sample, the reduced LRU stream is then loaded from disk, the cache state is warmed up and finally the simulation of the sample gets started.
The advantage over NSL is that NSL -BLRL requires less disk space to store the warmup memory references; in addition, the smaller size of the reduced LRU stream results in faster warmup processing. The advantage over BLRL is that loading the reduced LRU stream from disk is more efficient than the warm simulation needed for BLRL. According to our results, the warmup length for BLRL is at least two orders of magnitude longer than for NSL -BLRL. As such, significant speedups are to be obtained compared to BLRL.
Note that NSL-BLRL inherits the limitation from NSL of only guaranteeing perfect warmup for caches with LRU replacement. Caches with other replacement policies such as random, first-in first-out (FIFO), not-most-recently used are not guaranteed to get a perfectly warmed up cache state under NSL -BLRL (as is the case for NSL)-however, the difference in warmed up hardware state is very small, as we show experimentally in Section 6.6.
EXPERIMENTAL SETUP
For the evaluation we use 9 SPEC CPU2000 integer benchmarks 1 (see Table 1 ). The binaries, which were compiled and optimized for the Alpha 21264 processor, were taken from the SimpleScalar website. 2 All measurements presented in this paper are obtained using the MRRL software, 3 which in its turn is based on the SimpleScalar software [26] . The baseline processor simulation model is given in Table 2 . The caches use write-allocate and write-back policies. We consider 50 samples (each containing 1 M instructions). We select a sample every 100M instructions unless mentioned otherwise. These samples were taken from the beginning of the program execution to limit the simulation time while evaluating the various warmup strategies with varying percentiles K%. Taking samples deeper down the program execution would have been too time-consuming given the large fast-forwarding needed. However, we believe this does not affect the conclusions from this paper, since the warmup strategies that are evaluated in this paper can be applied to any collection of samples. Once a set of samples is provided, either warmup strategy can be applied to it. We quantify the performance of a warmup strategy using two metrics: accuracy and warmup length. The warmup length is defined as the number of instructions under warm simulation. The accuracy is quantified using the IPC prediction error, i.e. the procentual difference between the IPC for perfect warmup against the IPC for the warmup strategy of interest. A positive error means an IPC overestimation of the warmup approach compared to the perfect warmup case.
RESULTS
In this section, we extensively evaluate our NSL-BLRL approach and compare it against NSL and BLRL. We have a number of criteria to evaluate our improved warmup proposal, namely: accuracy, number of warm simulation instructions, overall simulation and the amount of storage requirements.
Accuracy
Our first criterion to evaluate NSL -BLRL is its accuracy. Figure 4 shows the IPC prediction error for BLRL, NSL and NSL -BLRL for the various benchmarks and for varying percentiles K%. (Note that NSL yields the same accuracy as NSL -BLRL 100%.) The IPC prediction error is the relative error compared to a full warmup run, i.e. all instructions prior to the sample are simulated. In the IPC prediction errors that we present here, we assume that there is no stale state (no stitch) when warming up the hardware state before simulating a sample. This is to stress the warmup techniques; in addition, this is also the error one would observe under checkpointed parallel sampled simulation. A number of comments and observations need to be made here. As reported in previous work, BLRL results in a highly accurate warmup. BLRL yields small IPC prediction errors of only a few percent. Especially for large percentiles K%, the IPC prediction error due to incorrect hardware state is very small. For example, for BLRL 95%, the maximum error is only 1.6% (twolf). For BLRL 100%, the error is almost zero. Comparing NSL -BLRL versus BLRL for a given percentile K% typically gives slightly higher IPC prediction errors, however, the difference is very small (,1%). There are two reasons for these slightly higher IPC prediction errors. First, NSL -BLRL only warms the cache state, but does not warm branch predictor state. BLRL on the other hand warms both the cache hierarchy and branch predictor state. However, we found this influence to be very small. To experimentally verify this, we compared the accuracy of NSL -BLRL versus BLRL for perfect branch predictors-this was to exclude the branch predictor component in the warmup state-and we obtained very similar results to what is being reported here in Fig. 4 . As such, we conclude that the 
CACHE WARMUP FOR SAMPLED PROCESSOR SIMULATION THROUGH NSL -BLRL Page 7 of 15
impact of the branch predictor state is very small. The second reason for the difference between the NSL -BLRL and BLRL is that while warming the caches through NSL -BLRL we do not keep track of dirty cache blocks, whereas BLRL does keep track of dirty cache blocks. Our results show that not warming dirty cache block info only has a small impact on overall accuracy. This is to be expected given the fact that contemporary out-of-order microprocessors give priority to load operations over writing back dirty data to upper layers of the memory hierarchy. However, if warming dirty cache blocks needs to be supported, extending our framework for supporting this would not be difficult. In summary, we can conclude that NSL -BLRL is a highly accurate cache warmup approach that is nearly as accurate as BLRL. Especially, high percentiles K% yield highly accurate performance estimates. The maximum error for K ¼ 95% equals 1.4% (twolf); for K ¼ 100%, the maximum error is even less, 0.66%.
Warmup length
We now compare the number of warm simulation instructions that need to be processed. Figure 5 shows the number of warm simulation instructions for BLRL as well as the number of warm simulation references for NSL -BLRL for different percentiles K%. Note that the vertical axis is on a logarithmic scale. We observe that NSL-BLRL yields a reduction in the number of warm simulation instructions by two to three orders of magnitude compared to BLRL. The reason for this dramatic reduction is that the number of warm simulation instructions for NSL -BLRL is proportional to the number of unique references in the pre-sample. BLRL on the other hand uses all references from a given warmup starting point up to the sample starting point. Note that these results were obtained for 100M instruction pre-samples prior to each sample. For larger pre-samples, the difference in the number of warm simulation instructions will even increase when comparing BLRL versus NSL -BLRL.
Comparing now NSL -BLRL versus NSL we also observe a substantial decrease in the number of warm simulation instructions. Figure 6 shows the number of warm simulation instructions of NSL -BLRL as a fraction of NSL. Some benchmarks do not benefit substantially from NSL-BLRL compared to NSL. However, we observe that NSL-BLRL 100% yields substantial warm simulation reductions for some benchmarks-up to 39% for bzip2; i.e. the warmup length for NSL-BLRL 100% is 61% of the NSL warmup length. For smaller K% percentiles, the reduction in warmup length increases significantly. Figures 7 and 8 show persample warmup lengths for bzip2 and parser, respectively. This graph shows that the reduction in warmup length depends on the given sample; some samples require a fairly large warmup under NSL -BLRL, whereas other samples only require a small fraction of warmup length comparing NSL -BLRL versus BLRL. The amount of reduction in warmup length depends on the temporal locality in the data address stream: in case of poor temporal locality, the reduction in warmup length will be limited; in case of good temporal locality, the reduction in warmup length will be substantial. Figure 9 shows the warmup length reduction factors through NSL -BLRL compared to BLRL as a function of the percentile K% The interesting insight from this graph is that the warmup length reduction factor increases as we increase the percentile K% As such, whereas increasing the percentile K% for BLRL is fairly costly in terms of total simulation time, increasing the percentile K% for NSL -BLRL is relatively cheap.
Again, all of these numbers are given for a 100M instruction pre-sample. For larger pre-sample sizes, the benefit for NSL-BLRL over NSL in terms of the number of warm simulation instructions even increases. This is illustrated in Fig. 10 where the number of warm simulation instructions is shown as a function of the pre-sample size for NSL and NSL-BLRL for bzip2 and parser. Similar curves were obtained for other benchmarks. The important trend to be observed from this graph is that the number of warm simulation instructions does not increase as fast for NSL -BLRL as it does for NSL. As such, we can conclude that NSL -BLRL is better scalable for larger pre-sample sizes and thus, longer running applications.
Simulation time
The number of warm simulation instructions only gives a vague idea of what the overall simulation time speedup would be for NSL -BLRL compared to BLRL. We identify two scenarios for sampled simulation, namely: using fast-forwarding to navigate between samples and checkpointing to restore machine state at the beginning of each sample. We first consider the case where fast-fowarding is used to go from one sample to the next sample. In this scenario, cold simulation is done through fast-forwarding. When the warm simulation starting point is reached for BLRL, warm simulation gets started until the beginning of the sample is reached. Then, simulation switches to hot simulation. For NSL -BLRL, cold simulation is done until the beginning of the sample, then the NSL-BLRL checkpoint is loaded from disk and the hardware state is updated. Once the hardware state is updated, hot simulation of the sample gets started. The results in Fig. 11 show the simulation time in seconds under fast-forwarding. We observe that BLRL achieves a substantial simulation time reduction compared to full warmup. NSL -BLRL reduces the overall simulation time even further, even onto a level where warmup using NSL-BLRL is nearly as fast as no-warmup. In other words, the cost for warming up hardware state under fast-forwarding is nearly zero under NSL-BLRL. Note also that different percentiles K% do not affect overall simulation time. As such, we could conclude that a percentile K ¼ 100% is the optimal choice since it gives the highest accuracy while incurring no additional simulation time overhead compared to smaller percentiles K%
We now consider checkpointing instead of fast-forwarding for jumping between the various samples. Under checkpointed sampled simulation, there is no cold simulation. Simulating a sample starts by loading a machine state checkpoint from disk and initiating the warm simulation. Under BLRL, the warm simulation phase involves warming up caches while functionally simulating all instructions prior to the sample. Under NSL and NSL -BLRL, warm simulation involves loading a machine state checkpoint. Once the machine state is updated, hot simulation gets started. Under checkpointed sampled simulation, we obtain the simulation time results presented in Fig. 12 . BLRL yields substantial simulation time reductions over full warmup. Note that the simulation time reductions under checkpointing are even bigger than under fast-forwarding. This is to be expected as checkpointed simulation does not require cold simulation opposed to fastforwarding. Another interesting note is that the simulation time reduction when comparing NSL -BLRL versus BLRL under checkpointing is higher than under fast-forwarding. Under fast-forwarding, NSL -BLRL achieves a reduction in simulation time over BLRL up to a factor 1.4X; under checkpointing, NSL -BLRL achieves a 2.9X up to 14.9X simulation time speedup over BLRL. This is to be explained for the same reason; checkpointed simulation does not involve cold simulation.
Note that comparing Fig. 11 against Fig. 12 does not make that much sense. Obviously, comparing these two graphs clearly shows the simulation speedup of checkpointed sampling against fast-forwarded sampling-and this speedup is around one order of magnitude here. However, there are a number of pitfalls. First, these numbers were obtained from our experimental setup where we used samples from the beginning of the program execution each having a pre-sample of 100M instructions. In practice however, in order to have a representative sample set, samples are likely to be chosen from the entire program execution. As such, fast-forwarding may become very costly in such a setup because in case there is a sample deep down the execution stream, the whole program execution may be functionally simulated in the end. Checkpointed sampling on the other hand will be very efficient. The simulation time under checkpointed sampling is then proportional to the number of samples, irrespective of whether these samples were taken deep down the execution stream. A second pitfall is that in these results we assume that all checkpointed samples get simulated on a single machine. Checkpointed sampling allows for simulating a number of samples in parallel on a cluster of machines. As such, even higher speedups are to be expected.
Storage
We now quantify the storage requirements of NSL -BLRL for storing the hardware state checkpoints on disk. Figure 13 shows the amount of storage requirements for NSL -BLRL compared to NSL. (Note that BLRL does not require any significant storage.) The numbers shown in Fig. 13 represent the number of MBs of storage needed to store one hardware state checkpoint in compressed format; Fig. 14 compares compressed versus uncompressed hardware state checkpoints for NSL-BLRL and NSL. For NSL, the average compressed storage requirement per sample is 810 kB; the maximum observed is for bzip2, 2.5 MB. For NSL -BLRL, the storage requirements are greatly reduced compared to NSL. For example, for K ¼ 100%, the average storage requirement is 553 kB (a 32% reduction); for K ¼ 95%, the average storage requirement is 425 kB (a 48% reduction). As such, we conclude that the real benefit of NSL -BLRL compared to NSL is its reduced storage requirements. (Recall that NSL -BLRL and NSL are comparable in terms of accuracy and simulation time.) In case a large number of checkpoints need to be stored on disk for a complete benchmark suite, then we can easily end up with thousands of samples and corresponding checkpoint files. For example, for SimPoint there are 7392 1M instruction samples for the whole SPEC CPU2000 benchmark suite. 4 If 810 kB needs to be stored on disk per-sample, then $6 GB disk space is required for storing the NSL hardware state warmup info. Note that this is an optimistic approximation. In our experimental setup we assumed 100 M instruction pre-samples. Larger pre-samples will results in even larger NSL warmup checkpoints to be stored on disk, as discussed previously (see also Fig. 10 ). As such, the total storage requirements are expected to be substantially larger than the 6 GB mentioned above. In addition, machine state checkpoints need to be stored on disk as well. Even though disks are cheap these days, maintaining such large checkpoint files might be impractical to do. We conclude that NSL -BLRL is capable of reducing the total disk space requirements for hardware state checkpointing by $30% without any loss in accuracy. 
NSL -BLRL versus NSL -MRRL
In our previous work, we combined NSL with MRRL [27] . Here in this paper, we combine NSL with BLRL for sampled processor simulation-our previous work only considered cache simulation. In our analysis, we examined the performance of both NSL -BLRL and NSL -MRRL for sampled processor simulation. We observed that NSL-BLRL generally outperforms NSL -MRRL. This is to be expected given the fact that BLRL was shown to be both more accurate and more efficient than MRRL [11] . However, our results showed that the benefit of NSL -BLRL compared to NSL -MRRL is not that big as the benefit of BLRL over MRRL. For some benchmarks, the warmup length for NSL -BLRL is only a few percent shorter than the warmup length for NSL -MRRL for the same level of accuracy. However, for other benchmarks we observed substantial reductions in warmup length for the same accuracy, see for example Fig. 15 where the IPC prediction error is shown as a function of the warmup length for gcc. We clearly observe that NSL-BLRL achieves a smaller IPC prediction error for the same warmup length, or reverse, a smaller warmup length for the same accuracy.
Cache replacement policies
Recall that NSL exploits the notion of the LRU cache replacement policy. As such, it is unclear whether NSL -BLRL is an accurate warmup technique for warming caches under different cache replacement policies. This is evaluated in Fig. 16 for the FIFO, the random and the LRU replacement policies. The graphs in Fig. 16 compare IPC obtained through sampled simulation with perfectly warmed-up caches versus IPC obtained through sampled simulation with NSL -BLRL cache warming. The IPC prediction error increases only slightly for the FIFO and random replacement policies compared to LRU. The average prediction error for LRU is 0.3% whereas the average prediction errors for FIFO and random are 1.3 and 2.3%, respectively.
Statistical significance
So far, the evaluation in terms of accuracy for NSL-BLRL was limited to relative IPC prediction errors compared to full warmup and other warmup strategies; however, there was no discussion in terms of how this translates into performance differences that may or may not be statistically significant. Haskins and Skadron [10] verified in a statistically rigorous manner that the null hypothesis saying that the warmed up cache state under MRRL is significantly different from the perfectly warmed up cache state, can be rejected at the 95% confidence level. The statistical analysis was done using the matched-pairs t-test, which compares the IPC values pairwise, i.e. the IPC under perfect warmup and MRRL is compared on a per-sample basis. In other words, based on this statistically rigorous analysis, they conclude that there is no statistically significant difference between MRRL and full warmup. Since NSL-BLRL and NSL-MRRL achieve the same or better accuracy for LRU caches than MRRL, as discussed in the previous section, we can thus conclude that NSL-BLRL results in statistically indifferent simulation results compared to full warmup. Page 12 of 15 L.V. ERTVELDE et al.
CONCLUSIONS
Sampled simulation is an important tool for computer architecture research and development. The idea behind sampled simulation is to select a well chosen number of samples from a complete program execution. There are two major issues related to sampled simulation, (i) the selection of representative samples and (ii) warming up the correct hardware state at the beginning of each sample, well known as the coldstart problem. This paper proposed to combine NSL with BLRL in a new warmup strategy called NSL -BLRL. The basic idea is to truncate the NSL stream of memory references in a pre-sample using BLRL information. The NSL stream is the LRV sequence of memory references in the pre-sample. BLRL then selects a fraction of this NSL stream based on how far back warmup needs to go in the pre-sample to accurately warmup the hardware state for the given sample. The NSL-BLRL warmup info could be viewed as a hardware state checkpoint. Warming up a cache hierarchy using NSL -BLRL is then done by loading the checkpoint from disk and warming the caches using the NSL-BLRL reference stream. Compared to other existing hardware state checkpointing techniques, NSL-BLRL is more flexible in the sense that the warmup info can be used for a broader range of hardware configurations. For example, whereas MHS and the TurboS-MARTS' live-points approach require a fixed cache block size, NSL -BLRL does not.
Our experimental results using SPEC CPU2000 benchmarks show that NSL -BLRL is substantially faster than BLRL. In other words, the number of warmup instructions is reduced by up to three orders of magnitude. NSL-BLRL is nearly as accurate as BLRL. The small deviation is due to not modeling dirty cache lines in NSL-BLRL, but we found this difference to be very small. The shorter warmup length for NSL-BLRL results in substantial simulation speedups against BLRL. Under fast-forwarding, the simulation speedup is up to 1.4X. Under checkpointing, the simulation speedup varies between 2.9X and 14.9X. Compared to NSL, the real benefit of NSL-BLRL is in the reduced checkpoint files that need to be stored on disk. (In terms of accuracy and simulation time, NSL-BLRL is nearly as efficient as NSL.) NSL -BLRL typically yields 30% smaller hardware state checkpoint files, which is important when it comes to storing a large number of checkpoint files on disk for a large number of samples.
