Introduction
Parallelization is a powerful means to accelerate large and/or complicated computation processes including those for discrete system simulation. Since Chandy and Misra published their classical paper [2] and TimeWarp technique was introduced by Jefferson [7] , a huge amount of researches have been conducted to apply their PDES method to various problems [4] . Architectural simulation of microprocessors which we aim to accelerate, however, have been hardly blessed with these general purpose means. This is mainly due to the fact that a simulated microprocessor model is hardly decomposed into loosely coupled parallel processes mapped on, for example, PC cluster nodes. Although a microprocessor consists of many components working in parallel, e.g. pipeline stages, multiple function units, caches, branch predictors and so on, these components communicate each other too frequently (e.g. every cycle) to hide interprocess communication costs which could be larger than those for operations performed inside processes (e.g. just one addition).
Even for systems with multiple processors, PDES methods have not achieved remarkable successes because, for example, a processor has a too large and/or complicated state for the rollback mechanism of TimeWarp. Therefore, a few successful parallel simulators such as WWT [11] and Shaman [8] rely on other techniques, fast barrier synchronization in the former and memory access filter in the latter. It has to be also noted that the targets of these simulators are multiprocessors of simple, in-order execution, single pipeline processors rather than modern out-of-order superscalars.
On the other hand, if parallel performance and simulation accuracy may be traded off, we may have a different and often efficient method in which a sequential simulation process is split along time-axis for parallelization. This time-division parallelization may be considered to have its origin in the classic trace sampling. For example, Highderburger and Stone [6] proposed a cache simulator in which a memory access trace is split into intervals and subsets of them are distributed among parallel simulation nodes. The degradation of simulation accuracy caused by cold-start of each interval is partly reduced by a warmup phase at the beginning of the interval.
Then Nguyen et al. [10] extended this idea to full trace analysis in which intervals overlap each other to form warmup phases. They also devised a heuristic method for their PARSIM to determine the length of a warmup phase using L1 cache hit ratio as the measure to estimate how each simulator node is warmed. Recently, Girbal et al. [5] proposed a more sophisticated mechanism to adjust the warmup duration dynamically. In their DiST simulator, a simulation node is initially assigned a sequence of intervals which overlaps the sequences for its predecessor and successor at its head and tail. Then, the node compares some statistical numbers such as ILP metric for its last interval and the first interval of its successor. The simulation of the sequence completes if the difference between two metrics are smaller than the user-defined threshold, or extends to the next interval and further until the condition is satisfied.
The parallel efficiency of these simulators is significantly high especially when we accept a certain level of inaccuracy. We may expect linear speedup for the method of [6] and also significant error inherent in the trace sampling. PARSIM should also exhibit good efficiency when the L1 cache hit ratio is quickly measured and the resulting warmup phase is short enough as reported in [10] . The sophistication in DiST degrades the efficiency in certain amount but it is still fast to exhibit up to 9-fold speedup and 7-fold plus on average with a 10-node cluster.
However, we have to remember that the good efficiency is obtained at great cost of accuracy. Even small error of single-digit percentage achieved in PARSIM and DiST could lead to a wrong conclusion on your innovative architecture [3] . More importantly, the amount of simulation error reported in the literatures is obtained by experiments with a specific architecture and a set of workloads. That is, there is no theoretical way to predict or to bound the amount of simulation error for your own architecture and/or workload.
In contrast to these methods with inherent inaccuracy, this paper proposes a parallel time-division simulation method which assures perfect accuracy. More specifically, our parallel simulator always produces results same as those obtained from its sequential basis, SimpleScalar [1] . Similar to other time-division simulators, a simulation process is split into intervals which are distributed among nodes to simulate them in parallel and with approximate initial machine states. The essential difference from previous work is that an interval simulation may fail if it produces a result different from what the correct initial state should derive. The failure is recovered by the node responsible to the preceding interval and thus having the correct initial state of the failed interval in order to assure perfect accuracy. Thus, we are free from the accuracy issue and may concentrate on the efficiency issue for the reduction of the possibility of the interval failure.
In the rest of the paper, the issues shown above are discussed as follows. The next Section 2 describes the principle of our parallelization more specifically. The key issue for efficiency is discussed in Section 3 which shows various techniques to reduce the failure rate. After showing a few implementation issues in Section 4, performance numbers obtained from our experiment with a 16-node cluster are shown in Section 5. Finally, Section 6 concludes the paper briefly showing the way for further performance improvement.
Principle of Parallelization
As shown in Figure 1 (a), a simulation process (gray fat arrow in the figure) is split into intervals I 0 , I 1 , . . . , each of which consists of a fixed number l of instructions to be executed. The intervals are distributed to n simulations nodes N 0 , . . . , N n−1 and simulated by them as follows. For the first interval group of n intervals I 0 , . . . , I n−1 , the k-th interval I k is assigned to the node N k . The head node N 0 performs fully cycle-accurate simulation, or full-simulation in short (black fat arrow), for the interval I 0 and additional w instructions following it and thus at the beginning of the next interval I 1 .
The following n − 2 nodes at first perform partial microarchitectural simulation, or partial-simulation in short (thin arrow), until they reach the beginning of assigned intervals. That is, the node N k performs partial-simulation for I 0 , . . . , I k−1 to obtain the architectural state and approximate microarchitectural state as discussed in Section 3. Then N k starts the full-simulation for I k but its first w instructions (white bar part), which are also simulated by the predecessor N k−1 , are used for warmup. It also extends its full-simulation range by w instructions as the overlapped warmup phase for its successor N k+1 . The tail node N n−1 does the almost same job but its additional w instructions are simulated by the node alone.
Each node, except for the head N 0 , examines the validity of the full-simulation when it is finished. For the examination, N k gets the final microarchitectural state of N k−1 's full-simulation and saves its own state at the end of its warmup phase as the approximate initial state. Then it is checked if the both states derive the same result as discussed in Section 3 1 . If all the full-simulations pass the examination, as shown in the figure, we move to the second n intervals. The head node, however, is now N n−1 because it has the correct state at the w-th instruction in the
Figure 1. Parallel Simulation without Failure
interval I n . Therefore, each other node N k is responsible for the interval I n+k+1 . In general, if the validity check is always passed, a node N k is assigned the set of intervals
Since the partial-simulation is expected to be faster than the full-simulation, 10 to 30 times faster in our experiments, the serial-looking simulation process in Figure 1 (a) should be mapped onto the parallel one in the real world as shown in (b). A rough estimation of the parallel speed-up is obtained as follows. Let R be the ratio of partial-simulation speed to full-simulation, r be the w's fraction of the number of instructions in an interval, i.e. r = w/l. Roughly speaking, each node repeats partial-simulations for n−1−r intervals and full-simulation for 1 + r intervals. Thus the speed-up S should be given by the following equation.
For example, if n = 8, R = 10 and r = 0.01 (1 % for warmup), the speed-up will be about 4.7 which means about 60 % parallel efficiency. In the story described above, it is assumed that all the full-simulations are judged as valid. This assumption, however, does not always hold and an interval full-simulation may fail due to insufficient approximation of its initial state as shown in Figure 2 . In this example, the full-simulations by N 0 and N 1 are valid but that by N 2 is judged as invalid. Note that the validity of the full-simulation by N 3 is not decidable because the final state of N 2 may be incorrect.
The failure must be recovered for the perfect accuracy by redoing the full-simulation of the failed interval. This recovery, for I 2 in our example, is done by N 1 because it has the correct initial state of the interval I 2 . That is, if the full-simulation by N k fails, the work is delegated to its predecessor N k⊖1 , where ⊖ means modulo n subtraction, and the predecessor simply continues its full-simulation of the interval originally assigned to N k to recover the failure. As
Figure 2. Parallel Simulation with Failure
for the failed node and its successors, N 2 and N 3 in our example, they also continue their full-simulations exploiting their originally assigned intervals as long warmup phases. Since a long warmup phase greatly raises the similarity between the machine state at its end and the correct one, it is strongly expected that the full-simulation following it is valid 2 . In general, if N k fails its full-simulation and the tail node is N t at the time of the failure, the work for succeeding nodes N k⊕1 , . . . , N t is delegated to their direct predecessor N k , . . . , N t⊖1 , where ⊕ means modulo n addition. As for the tail node N t which does not have any successors and was responsible to I j at the failure, it simply continues its full-simulation for the next interval I j+1 .
This simple failure recovery mechanism is sufficient to assure the perfect accuracy and almost bounds the possibility of failure at once in an interval group by the long warmup phase. However, the performance loss due to a failure is severe because it halves the parallel efficiency of the simulation for an interval group. Thus, we have to try to reduce the possibility of failure as discussed in the next section.
Failure Rate Reduction
This section describes various techniques incorporated into our simulator to reduce the possibility of interval simulation failure. At first we briefly discuss classic warmup, then introduce two techniques in fast-forward partial-simulation, and finally explain how we examine the validity of the full-simulation for each interval.
Warmup
Warmup is a classic technique to approximate microarchitecture and has its basis in the fact that hardware components cannot remember their states in the far past. For example, an instruction pipeline is a complicated and large state machine but its state tends to be decided by a limited number of its recent inputs, i.e. instructions and access results from caches, branch predictors, and so on. Moreover, the pipeline frequently forgets events in its near past due to branch misprediction. Since a misprediction flushes the instructions on the wrong path, there becomes a large space between the instructions newly fetched from the correct path and those remaining in the pipeline to make the behaviors of the former ones independent from the latter. Since 1 % misprediction is usually inevitable and a conditional branch occurs in every a few tens of instructions, a warmup of a few thousands instructions should be sufficient to start an interval simulation with an arbitrary state, e.g. empty state, of pipeline.
An element of caches and branch prediction tables, e.g. a set of a cache, is also forgettable. They lose their memory by a fixed number of accesses with certain patterns. For example, a set of a s-way set-associative cache must lose its past with an access sequence of s different tags. A twobit saturation counter for branch prediction could have long memory but must forget it with a branch sequence in which the number of taken branches is three more than non-takens and vice versa. However, these do not mean that a cache or a table as a whole is forgettable, because it may have elements accessed very infrequently in its large population. Therefore, even a significantly long warmup phase may not be sufficient if we initialize caches and tables with some arbitrary values, e.g. fully invalidated states.
Basic Partial-Simulation of Caches and Branch Predictors
The state of a each element of caches and branch prediction tables is determined by the access sequence to it.
Although the exact sequence depends on out-of-order and speculative execution, a good approximation can be obtained from the instruction sequence executed in-order. That is, an ordinary simulation of caches and branch predictors, e.g. sim-cache plus sim-bpred of SimpleScalar, should give us a good approximation together with the complete setting of architectural state in registers and memory.
Thus in our baseline simulator, its partial-simulation is performed taking care of caches, TLBs and branch predictors. A possible drawback of this method is to degrade the performance of partial-simulation while it should drastically reduce the failure ratio. In fact, our partial-simulator is about three times as slow as a simple ISA-level simulator sim-fast. However, this trade-off should be justified by the fact that a failure halves the performance of an interval group while the performance loss by the three-times slowdown of partial-simulation is estimated less than 30 % according to the equation (1) in Section 2. Moreover, our technique to accelerate the partial-simulation, which will be discussed in Section 4, will greatly compensate the degradation.
Partial-Simulation of Speculative Execution
The partial-simulation of caches and other components discussed above does not guarantee the correctness of the contents in them mainly because it does not care about speculative execution. This neglect of speculation, unfortunately, significantly impacts on the precision of the approximation of instruction cache and TLB as follows.
Suppose we have a conditional branch whose target is hardly taken but the predictor mispredicts that it will be taken on its first occurrence. Thus the target and subsequent instructions are fetched and are squashed afterward when the misprediction is recognized. However these instructions along the mispredicted wrong path must have been loaded into the hierarchy of instruction caches and so done their page information into the instruction TLB. Since the taken condition is hardly satisfied, the predictor quickly learns that and thus the wrong path will not be accessed in a long duration. However the instructions may stay in the caches and TLB if these components have a large capacity enough to keep them whereas they are not in the workingset. Then, after a long duration with many intervals in our simulation, the branch target is eventually taken as the first occasion in the architectural execution. Now we have a failure because our partial-simulator has never fetched the instructions along the wrong path, while the real execution has done to keep them somewhere in the hierarchy of instruction caches and/or TLB.
Unfortunately, this scenario is not an artificially fictionalized story, but a daily-news-level frequent accident report.
In fact, our baseline simulator suffers this mismatch so frequently that almost all failures occur due to this scenario. So in our partially-speculative version of the simulator, it partially simulates the action of a mispredicted speculations. The additional simulation, being so simple that its overhead is as light as possible, is as follows. In the partialsimulation, mispredictions are detected through its approximate simulation of branch predictors highly accurately because the action of the predictors themselves are hardly affected by the speculative or out-of-order execution 3 . When a misprediction is detected, the instructions along the wrong path are fetched and is filled into caches until one of the following is satisfied; a constant number m of instructions are fetched; or a fetch misses the primary instruction cache. The second condition implies that the misprediction will be recognized by the pipeline back-end during its front-end are handling the cache miss.
This simple mechanism is expected to reduce the failure rate by filling hardly accessed instructions along wrong path into caches, but is not perfect of course. Thus we evaluated how effectively it reduces the failure rate and improves the total performance as discussed in Section 5.
Examination of Interval Validity
If the full-simulation of an interval I k is performed with the correct initial microarchitectural state after the warmup, which perfectly matches the state at the end of I k−1 , the interval simulation is obviously valid. This is, however, a sufficient condition but is not a necessary condition. In fact, an element of a cache after I k 's warmup, for example, may be different from that at the end of I k−1 if the element is not accessed in I k . Moreover, even if the element is accessed, the interval simulation may be valid if both approximate and correct values cause the same result, for example both cause a cache miss and a line replacement in the same manner with respect to the effect (e.g. write-back) to the higher level cache or memory. We call the approximate state of a microarchitectural component coherent with its correct state if the both states give the same results for all accesses in the interval in problem.
Intuitively, the examination of the coherence could be done by preserving an access trace for each component and by examining if each access result recorded in the trace is same as that obtained by applying the trace to the correct state. This naive implementation, however, obviously incurs a large overhead not only temporal due to the trace application to the correct state but also spatial for the large The key idea of the trace reduction is that we may not need traces after an element forgets its past before the end of the warmup as shown in Figure 3 . For example, for an element of a s-way set associative cache, we may not need to trace accesses after we see accesses with s different tags (e.g. a, c, d and b for set x in the figure) because after that the state of the element should be independent from its initial state. Moreover, even before having s tags, an access with a tag may not be traced if another access with the tag has already been traced (e.g. second a/x in the figure), because it hits to both caches. Therefore, we need only s accesses for each element and thus the total trace size should be O(C) for the cache of capacity C independent from the number of accesses to the cache in the interval.
The trace is also useful to build the correct state of caches and other components after an interval judged valid. For example, since a set of the s-way set associative cache may not be fully accessed in the interval, its final state may have unaccessed ways thus containing potentially incorrect tags. The existence of such ways is detected if the number of accesses in the trace for the set is less than s, and the correct values are found in its correct initial state. In the example shown in Figure 3 , the final state of the set y has an unaccessed and incorrect tag e. The tag is replaced with the correct tag d which was the most recent one among the unaccessed tags in the correct initial state.
Implementation
Our implementation is based on SimpleScalar 3.0, whose sim-outorder is the basis of our full-simulator while simcache and sim-bpred are the bases of our partial-simulator for the approximation. Our choice of SimpleScalar is just for its popularity and the availability of its source code. That 
is, our parallelization method is applicable to any cycle accurate simulators with reasonably designed representations of microarchitectural components of target machines.
We also applied our acceleration technique for in-order simulation [9] to boost our fast-forwarding partial-simulator to have fast-forward-boosted version. This technique is similar to binary translation technique and has comparable efficiency but is much more portable than that because it produces a C source code to simulate the workload binary. That is, each basic block in the workload binary is translated into a C function to simulate the architectural operations, cache and TLB accesses, and the lookups and modifications of branch prediction tables performed by the instructions in the block. This translation mechanism is so helpful for our application that we needed only a few days to build the accelerated version with the partially-speculative execution discussed in Section 3.3.
As discussed in Section 3.4, we have two methods to examine the validity of an interval for each hardware component. One is to check the equivalence of the correct and approximate state and must be applied to instruction pipeline. The other method to check the coherence with access trace is preferable to caches because of their large capacity and sensitivity to speculative execution. As for other components, since the choice depends on their nature, we investigated the pass ratio of the first simple examination with a few example workloads. Our choice is shown in the Table 1 which also lists the components equipped in the default configuration of SimpleScalar and our experiment together with their quantitative parameters. As shown in the table, we chose the equivalence check for branch prediction tables because their contents are hardly polluted by speculative execution. The return address stack, however, may have stale addresses pushed by speculative branch-and-links below its bottom because it is implemented with a cyclic buffer. Thus we apply the tracing technique for the elements popped but not pushed in the interval.
Another important implement issue is to determine three parameters; the number of instructions in an interval l; that for warmup w; and the maximum number of instructions fetched on a misprediction m in the partial-simulation of the partially-speculative version. For the last two parameters w and m, we also made a few investigations to obtain a nearly optimal setting w = 1 × 10 6 and m = 12. On the other hand, we need more experiments to tune the length of an interval l because its optimal value should depend on the characteristic of each workload and the number of simulation nodes. In general, the longer an interval is, the smaller the communication, synchronization and warmup overhead is but the larger the penalty payed for a failure is. Therefore, we tried two settings, l = 20 × 10 6 (denoted as 20M) and l = 100 × 10 6 (100M) in our experiments discussed in Section 5.
Experiments
We evaluated the performance of three versions of our simulator, baseline, partially-speculative and fast-forwardboosted. The workloads are SPEC CPU95 benchmarks with "train" dataset but we omitted "swim", "applu" and "fpppp" from FP members and used only "vortex" in INT ones because executed instructions of omitted benchmarks, at most 1.46 × 10 9 of "ijpeg", are less than 100 × 10 6 × 16 = 1.6 × 10 9 which we need for 16-node parallel simulation with the larger interval setting l = 100M.
The environment for the experiment is our 16-node cluster of Xeon of 2.8 GHz with 1 GB memory for each. The nodes are connected by 1 Gbps Ethernet links through a switch with wire-speed switching capability. The simulator programs are compiled by gcc 3.3.2 with "-O2" option, are linked with MPICH 1.2.7, and run under the control of Vine Linux 3.1 with kernel 2.4.31.
The two graphs in Figure 4 show the speedup from the original version of SimpleScalar's sim-outorder and the failure rate of our baseline simulator. In these and following graphs, four bars correspond to l = 20M and 100M with 8-node execution (n = 8) and those with 16-node execution (n = 16) from left to right.
The resulting 8-node speedup up to 4.7 (apsi, 20M) and 3.6 on average of 20M setting are not very satisfactory. As for the 16-node cases, up to 6.1 (apsi, 20M) and 4.8 on average of 20M are also discouraging. These relatively poor parallel performance numbers are due to high failure ratio shown in the figure. That is, since R of the equation (1) in Section 2, the ratio of partial-and full-simulation speeds, is 10.4 to 12.2 and 11.3 on average, we may expect 4.8-4.9 and 6.7-6.8 speedup in 8-node and 16-node cases. Although the highest numbers are just a little bit less than the expected, others and the average are significantly below due to 10 % or more failure rate. This degradation by failures notably appears in 100M cases in "turb3d", "apsi" and "wave5" due to the large failure panalty with the longer intervals.
These discouraging appearances are significantly improved by the partially-speculative version as shown in Figure 5 . This version greatly reduces the failure rate by up to 95 % in 8-node 100M case of "turb3d" whose 8-node and 16-node speedups is boosted from 3.5 and 4.5 in the baseline to 4.5 and 6.4 respectively. The performance of 100M setting are also improved in "mgrid", "apsi" and "wave5" because a failure impacts on 100M case performance more severely than 20M case. As a total, about 30 % failure rate reduction improves the 100M case average performance by 12 % while 20M one grows by 6 %.
The performance, however, is not improved always and a few benchmarks with 100M setting suffer a small speedup loss about 1 % because their failure rates are not reduced but the partial-simulation speed degrades about 10 %. Thus we applied the fast-forward-boosting technique to compensate the degradation. The result shown in Figure 6 is much better than those in Figure 4 and 5. In fact, we now have up to 5.8-fold and 9.4-fold speedup (turb3d) with 8 and 16-node configurations respectively by 3-fold acceleration of partialsimulation. The average numbers are also much better than previous ones, about 15 % and 25 % improvement in 8 and 16-node configurations, and the 8-node numbers are now at The 16-node average speedup, 6.4-fold, still needs improvement. For example, "tomcatv" shows about 6-fold speedup, a little bit less than the average, while its failure rate is less than 2 % and much less than the average. This is mainly due to the load imbalance and synchronization overhead revealed by the imbalance. That is, a node has a little bit smaller load than its successor in its partial-simulation as shown in Figure 1 (b) in Section 2. The node, however, should wait the completion of the interval simulation for its successor because it can be responsible of the interval in the case of failure. This idle time may be removed by tuning the length of partial-and full-simulation for each node dynamically, rather than fixing it as our current implementation does. This improvement and further failure rate reduction are left as our future work.
Conclusions
We proposed, implemented and evaluated a parallel cycle-accurate microarchitectural simulator. The timedivision parallelization of the simulator is similar to previous proposals which trade off parallel performance and accuracy, but essentially different from them because our simulator assures the accurate result which is equivalent to that obtained from the sequential simulator SimpleScalar. For the assurance, a simulation interval is redone if its result is invalid due to an insufficient approximation of its initial microarchitectural state. In order to reduce the possibility of this interval failure, our simulator performs partial simulation of microarchitecture and state coherence examination based on access tracing in addition to classic warmup.
By these techniques and our source-code translation technique to accelerate the partial-simulation, we achieved
