Abstract Microarchitecture research and development rely heavily on simulators. The ideal simulator should be simple and easy to develop, it should be precise, accurate and very fast. But the ideal simulator does not exist, and microarchitects use different sorts of simulators at different stages of the development of a processor, depending on which is most important, accuracy or simulation speed. Approximate microarchitecture models, which trade accuracy for simulation speed, are very useful for research and design space exploration, provided the loss of accuracy remains acceptable. Behavioral superscalar core modeling is a possible way to trade accuracy for simulation speed in situations where the focus of the study is not the core itself. In this approach, a superscalar core is viewed as a black box emitting requests to the uncore at certain times. A behavioral core model can be connected to a detailed uncore model. Behavioral core models are built from detailed simulations. Once the time to build the model is amortized, important simulation speedups can be obtained. We describe and study a new method for defining behavioral models for modern superscalar cores. less than 10 % in most cases. We show that BADCO is qualitatively accurate, being able to predict how performance changes when we change the uncore. The simulation speedups we obtained are typically between one and two orders of magnitude.
less than 10 % in most cases. We show that BADCO is qualitatively accurate, being
To the best of our knowledge, the work by Lee et al. is the only previous study that has focused specifically on behavioral superscalar core modeling [16] . They found that behavioral core models could bring important simulation speedups with a reasonably good accuracy. However the detailed simulator that they used, SimpleScalar sim-outorder [1] , does not model precisely all the mechanisms of a modern superscalar processor. We present in this paper an evaluation of Lee et al.'s Pairwise Dependent Cache Miss (PDCM) core modeling method using the Zesto detailed simulator, a detailed model of a modern superscalar microarchitecture [18] . We implemented a core model based on the PDCM approach with a reasonably good accuracy. Still, we identified some opportunities to improve the accuracy.
This has led us to propose a new method for behavioral application-depen-dent superscalar core modeling, BADCO, inspired by but different from PDCM. Unlike PDCM, which uses a single detailed simulation to build the core model, BADCO uses two detailed simulations. The first detailed simulation, identical to the one performed for PDCM, provides timing information for µops when all level-1 (L1) miss requests have a null latency. For the second simulation, we force a long latency on all L1 miss requests. Unlike PDCM, which uses a structural approach to find the dependences between requests, BADCO infers dependences from the timing information provided by the second detailed simulation.
The accuracy of BADCO is on average better than that of PDCM on all the configurations we have tested. We have studied not only the ability of BADCO to predict raw performance but also its ability to predict how performance changes when we change the uncore. Our experiments demonstrate a good qualitative accuracy of BADCO, which is important for design space exploration. The simulation speedups we obtained for PDCM and BADCO are in the same ranges, typically between one and two orders of magnitude.
This paper is organized as follows. In Sect. 2, we discuss previous work on core modeling. Section 3 illustrates the limits of approximate core modeling. Section 4 briefly describes PDCM and how we adapted it for the Zesto simulator. We describe the proposed BADCO modeling method in Sect. 5. Section 6 presents an experimental evaluation of the accuracy and simulation speed of PDCM and BADCO for single core applications. Section 7 evaluates BADCO's accuracy and simulation speed for multiprogram workloads.
Previous Work on Superscalar Core Modeling
Trace-driven simulation is a classical way to implement approximate processor models. Trace-driven simulation does not model exactly (and very often ignores) the impact of instructions fetched on mispredicted paths and it cannot simulate certain data mispeculation effects. The primary goal of these approximations is not to speed up simulations but to decrease the simulator development time. A trace-driven simulator can be more or less detailed: the more detailed, the slower. We focus here on modeling techniques that can be used to implement a core model and that can potentially bring important simulation speedups.
Structural Core Models
Structural models speed up superscalar processor simulation by modeling only "first order" parameters, i.e., the parameters that are supposed to have the greatest performance impact in general. Structural models can be more or less accurate depending on how many parameters are modeled. Hence there is a tradeoff between accuracy and simulation speedup.
Loh described a time-stamping method [19] that processes dynamic instructions one by one instead of simulating cycle by cycle as in detailed performance models. A form of time-stamping had already been implemented in the DirectRSIM multiprocessor simulator [4, 26] . Loh's time-stamping method uses scoreboards to model the impact of certain limited resources (e.g., ALUs). The main approximation is that the execution time for an instruction depends only on instructions preceding it in sequential order. This assumption is generally not exact in modern processors.
Fields et al. used a dependence graph model of superscalar processor performance to analyze quickly the microarchitecture performance bottlenecks [7] . Each node in the graph represents a dynamic instruction in a particular state, e.g., the fact that the instruction is ready to execute. Directed edges between nodes represent dependences, e.g., the fact that an instruction cannot enter the reorder buffer (ROB) until the instruction that is ROB-size instructions ahead is retired.
Karkhanis and Smith described a "first-order" performance model [14] , which was later refined [2, 5, 6] . Instructions are (quickly) processed one by one to obtain certain statistics, like the CPI (average number of cycles per instruction) in the absence of miss events, the number of branch mispredictions, the number of non-overlapped long data cache misses, and so on. Eventually, these statistics are combined in a simple mathematical formula that gives an approximate global CPI. The model assumes that limited resources, like the issue width, either are large enough to not impact performance or are completely saturated (in a balanced microarchitecture, this assumption is generally not true [22] ). Nevertheless, this model provides interesting insights. Recently, a method called interval simulation was introduced for building core models based on the first-order performance model [9, 24] . Interval simulation permits building a core model relatively quickly from scratch.
Another recently proposed structural core model, called In-N-Out, achieves simulation speedup by simulating only first-order parameters, like interval simulation, but also by storing in a trace some preprocessed microarchitecture-independent information (e.g., longest dependence chains lengths), considering that the time to generate the trace is paid only once and is amortized over several simulations [15] .
Behavioral Core Models
Kanaujia et al. proposed a behavioral core model for accelerating the simulation of multicore processors running homogeneous multi-programmed workloads [13] : one core is simulated with a detailed model, and the others cores mimic the detailed core approximately.
Li et al. used a behavioral core model to study multicores running heterogeneous multi-programmed workloads [17] . Their behavioral model simulates not only per-formance but also power consumption and temperature. The core model consists of a trace of level-2 (L2) cache requests annotated with access times and power values. This per-application trace is generated from a detailed simulation of a given application, in isolation and assuming a fixed L2 cache size. Then, this trace is used for fast multicore simulations. The model is not accurate because the recorded access times are different from the real ones. Therefore the authors do several multicore simulations to refine the model progressively, the L2 access times for the next simulation being corrected progressively based on statistics from the previous simulation. In the context of their study, the authors found that 3 multicore simulations were enough to reach a good accuracy.
The ASPEN behavioral core model was briefly described by Moses et al. [20] . This model consists of a trace containing load and store misses annotated with timestamps [20] . Based on the timestamps, they determine whether a memory access is blocking or non-blocking.
Lee et al. proposed and studied several behavioral core models [3, 16] . These models consist of a trace of L2 accesses annotated with some information, in particular timestamps, like in the ASPEN model. They studied different modeling options and found that, for accuracy, it is important to consider memory-level parallelism. Their most accurate model, Pairwise Dependent Cache Miss (PDCM), simulates the effect of the reorder buffer and takes into account dependences between L2 accesses. We describe in Sect. 4 our implementation of PDCM for the Zesto microarchitecture model.
Behavioral Core Models for Multi-Core Simulation
Behavioral core models can be used to investigate various questions concerning the execution of workloads consisting of multiple independent tasks [17, 27] . Once behavioral models have been built for a set of independent tasks, they can be easily combined to simulate a multi-core running several tasks simultaneously. This is particularly interesting for studying a large number of multiprogram workloads, as the time spent building each model is largely amortized.
Simulating accurately the behavior of parallel programs is more difficult. Tracedriven simulation (functional-first simulation in general) cannot simulate accurately the behavior of non-deterministic parallel programs for which the sequence of instructions executed by a thread may be strongly dependent on the timing of requests to the uncore [10] . Some previous studies have shown that trace-driven simulation could reproduce somewhat accurately the behavior of certain parallel programs [9, 10] , and it may be possible to implement behavioral core models for such programs [3, 23] . Nevertheless, behavioral core modeling may not be the most appropriate simulation tool for studying the execution of parallel programs.
The Limits of Approximate Microarchitecture Modeling
The curves on Fig. 1 with the Zesto simulator [18] and show the normalized execution time as a function of the L1 miss latency, assuming that the miss latency is uniform and constant. One would expect these curves to be monotonically increasing and convex (see the Appendix): as the miss latency is increased, there should be more and more misses on the critical path (the chain of dependent events that determines the overall execution time [8] ). The curve for h264ref is nearly convex, as are the curves for a majority of our benchmarks. However, some benchmarks like libquantum have a clearly non-convex curve. This shows that the critical path, though a convenient conceptual tool, does not reflect completely what happens in a OoO microarchitecture. This illustrates the inherent difficulty of defining approximate microarchitecture performance models. The behavior of an OoO core depends on many mechanisms interacting in a complex way and impacting performance. Such complex behavior is difficult to reproduce with a simplified model, be it structural or behavioral. With this limitation in mind, the aim of approximate microarchitecture modeling is to find a good trade-off between simulation accuracy and simulation speed. Lee et al. present in [16] three different behavioral models: isolated cache miss, independent cache miss, and pairwise dependent cache miss. The models use traces of L2 accesses annotated with additional information to approximate the behavior of an OoO core. The isolated cache miss model is a pessimistic approach, where all trace-items are processed in a sequential way, that leads to an overestimation of the execution time. The independent cache miss model uses the ROB to control the number of items that can access memory. This model assumes total independence among trace-items. The performance results highly underestimate the cycle count for benchmarks with many dependencies between memory accesses. Finally, the pairwise dependent cache miss model (PDCM) improves over the independent cache miss model by considering dependencies between items. PDCM is accurate for an idealistic OoO core with perfect branch prediction and no hardware prefetch. In this section, we study the PDCM model and how it performs with a realistic OoO core. We also propose some improvements to increase PDCM's accuracy.
PDCM Simulation Flow
The simulation flow of PDCM behavioral core model has two phases: model building and trace simulation. Figure 2 presents the simulation flow for PDCM. During the model building phase, a per-application trace is generated from a detailed microarchitecture simulator, 1 assuming an ideal L2 cache, i.e., forcing an L2 cache hit on each L1 cache miss. Each trace item represents an instruction with an L1 miss. The trace item information contains (1) the request type (read, write, instruction, etc.); (2) the instruction delta, i.e., the number of instructions between this L1 miss and the next L1 miss; (3) the time delta, i.e., the number of cycles elapsed between this L1 miss and the next L1 miss; and (4) a data dependence, i.e., on which previous L1 miss this L1 miss depends, directly or indirectly. This data dependence is found by analyzing register and memory dependences during trace generation, taking into account the indirect dependences caused by delayed L1 hits. 2 During the trace-driven simulation, the time deltas and the dependences are used to compute the issue time of each L1 miss. Dependences include both data dependences and structural dependences induced by the limited reorder buffer (ROB). In particular, the instruction deltas are used to simulate the effect of the limited ROB and determine whether or not independent L1 misses can overlap.
Adapting PDCM for Realistic OoO Core
The original PDCM was tested with SimpleScalar sim-outorder microarchitecture model assuming 100 % correct branch predictions [16] . Zesto is more detailed than sim-outorder, and we had to spend substantial effort adapting PDCM for Zesto in order to improve the accuracy until similar level to that described in [16] . Figure 3 illustrates our efforts. The first bar (leftmost) shows the accuracy obtained with our initial implementation of PDCM, based on what is explicitly described in the original PDCM paper, taking into account the limited MSHRs and assuming a perfect branch prediction. The second bar shows the impact of having a realistic branch predictor and activating hardware prefetchers: unsurprisingly, the accuracy degrades. Then we improved the accuracy, keeping the general principles of the PDCM approach: we have introduced in the model TLB misses (third bar), write backs (fourth bar), wrong-path L1 misses, L1 prefetch requests (sixth bar), and more precise modeling of delayed hits (last bar). The numbers shown for PDCM in the remaining of this study were obtained with our optimized version. Here, we present a fully-detailed description of our improvements to PDCM.
TLB Misses and Inter-Request Dependencies
The original PDCM considers only three types of L1 misses: instruction, load and store misses. Computer systems that support virtual memory use translation lookahead buffers (TLBs) to translate virtual addresses into physical addresses. TLBs are small caches with the only function of helping in the translation of virtual addresses. Modern processors include at least two TLBs: one for instructions and one for data. TLB misses introduce an extra delay on memory accesses. Additionally, if a memory access generates both TLB and L1 cache misses, then the TLB miss must be resolved before the L1 miss start being processed. A similar dependency exists between instructions misses and data misses. It occurs when the instruction misses at fetch, and then during execution the same instruction generates a memory access that also misses. An extreme case happens when the same instruction misses during fetch in both instruction TLB and instruction L1 cache, and misses again in both data TLB and data L1 cache. In this case, the latency of the four misses is serialized. In this context, a PDCM model must be able to reproduce this behavior during trace simulation. Hence, we have extended trace items to contain multiple requests, which may have sequential dependencies among them. During trace simulation, we reproduce the dependencies between requests associated with the same trace item.
Write-Backs
Another type of memory requests not considered originally by PDCM are the writebacks (WBs). A WB occurs when a dirty cache-block is selected as victim to be replaced by another incoming block. The evicted block is inserted into the WB-buffer and waits there until it can be written to the next level in the memory hierarchy. A WB generally does not have an immediate impact on performance, but it consumes bandwidth and increases the latency of other L1 misses when the WB-buffer is full. To include WBs in PDCM we must associate this type of memory requests to trace items. Hence, during trace generation we attribute a WB to the trace item whose L1 data miss causes the cache-block eviction. During trace simulation WBs are issued to the uncore after the associated L1 data miss completes.
Branch Miss Predictions
Every modern superscalar core uses branch prediction to reduce the penalty that branch instructions cause, when executed in long pipelines. If a branch misprediction occurs, then the core must roll back to the mispredicted branch and flush all the instructions in the wrong path. However, during the execution of the wrong path, the core may initiate all kind of uncore requests. 3 The impact on performance of wrong-path requests has been a matter of study [21, 25] . Some wrong-path requests behave as prefetch requests, and bring blocks to the cache that will be used in the future, contributing in this way to improve performance. However, wrong-path requests consume bandwidth through the memory hierarchy, and they may pollute the caches. Besides, wrong-path requests may also initiate additional WBs. An extra complexity is the fact that the number of wrong-path requests may change if the resolution of the branch depends on a long latency request.
Our experimental results show that omitting wrong-path requests leads to an underestimation of the cycle count. In order to improve the accuracy, we must capture and trace wrong-path requests. Hence, a mispredicted branch generates a trace item to which all the requests on the wrong path are attributed. During trace simulation, a mispredicted-branch item will stall the fetch of new items until it completes execution and all the wrong-path requests have been issued to the uncore. 4 A mispredicted item completes when it has issued to the uncore all its attributed requests.
Prefetching
The purpose of a hardware prefetcher is to fetch cache blocks before they are needed by the program. The prefetcher uses the stream of memory accesses/misses to predict which blocks will be needed in the future. If the predicted block is not already present in cache, and the cache and bus are not too busy, 5 then a prefetch request is issued to the cache. Not all prefetch requests hide completely the latency of memory accesses. Hence, it may happen that demand misses become delayed hits on a prefetch. Prefetch requests account for an important percentage of the traffic through the memory hierarchy. Moreover, prefetch requests may pollute the cache, and they also initiate additional WB requests.
Considering the effect of prefetch requests on a core model such as PDCM is complex. On one side, a prefetch request depends on the stream of accesses/misses on the L1 cache. On the other side, a prefetch request also depends on the performance of the uncore. Furthermore, delayed hits pending on a prefetch requests impact performance. Ignoring the impact of prefetch requests on performance may lead to misestimate the cycle count.
Experimentally, we have observed that the total number of read requests (demand misses + prefetch) to the uncore does not change significantly from one uncore configuration to another. What we have is an exchange of L1 misses for prefetch requests and vice versa. During trace generation, we record all prefetch requests, and thus we guarantee that during trace-driven simulation, PDCM issues a similar number of read requests to the uncore. However, a request that for PDCM trace-driven simulation is a prefetch may be a demand miss on a corresponding detailed simulation, or the other way around.
During trace generation, prefetch requests are attributed to the instruction that triggered the prefetch. In particular, we have configured Zesto to generate L1 prefetch requests on a miss. Therefore, a prefetch request is attributed to the same item as the L1 demand miss. During trace simulation, prefetch requests are issued to the uncore simultaneously with demand misses. However, a trace item does not need to wait for the prefetch request to return to be completed.
Delayed Hits
A delayed hit is a memory reference to a cache block for which a request has already been initiated by another instruction but has not yet completed, i.e., the requested block is still on its way from memory [2] . Delayed hits were considered by Lee et al. in the original PDCM model as an instrument to account for indirect dependences. The same problem has been addressed by Chen et al. in [2] . In the context of a limited number of overlapping long latency data cache misses due to finite MSHR resources, delayed hits have an additional impact on performance that must be addressed. In particular, Zesto models the MSHR in such a way that each L1 cache miss occupies an MSHR entry. As a result, delayed hits also occupy MSHR entries, and thus they can limit the effective number of outstanding requests that can be processed simultaneously. We have found this problem to be extremely important when modeling the performance of benchmarks such as libquantum and hmmer.
One limitation of PDCM is that the trace is generated assuming an ideal L2 cache, thus it does not capture all delayed hits that may be present when long latencies are simulated. In order to overcome this problem, during trace generation, we search for additional load or store instructions, that in the case of a long latency access, would have been delayed hits. During trace simulation, the information about delayed hits is used in conjunction with the number of requests already in flight and the total number of MSHR entries to limit the number of outstanding requests.
PDCM Limitations
It should be noted that PDCM is a behavioral model as the time deltas are obtained from a detailed microarchitecture simulation. Because the time deltas correspond to an ideal L2 cache, PDCM is very accurate when L2 misses are few. However, PDCM uses a structural approach to model the impact of L2 misses: it is assumed that modeling the effect of the ROB and data dependences is sufficient to reproduce accurately the performance impact of L2 misses. Yet, core resources other than the ROB may impact performance significantly, for instance the limited number of ALUs, L1 cache ports, reservation stations, etc. Even considering an unlimited ROB, the time deltas between consecutive and data-independent L1 misses may depend on the miss latency, e.g., because of resource conflicts happening differently (the miss latency may impact the order in which instructions are executed and how many times instructions are rescheduled), or because a mispredicted branch is data-dependent on an L1 miss. In this context, Sect. 5 presents a new behavioral core model that is inspired from PDCM model but tries to overcome its limitations.
BADCO: A New Behavioral Core Model
The new behavioral model we propose, BADCO, is inspired from PDCM. However BADCO uses a behavioral method to find dependences between requests to the uncore, unlike in PDCM where an explicit data-dependence analysis is performed. Unlike PDCM which uses a single detailed simulation to build the core model, BADCO uses two detailed simulations. Figure 4 presents the simulation flow of BADCO, which has three main phases: trace generation of two traces, model building, and model simulation phase.
For the first detailed simulation, we force the latency of each uncore request to zero. This simulation is identical to the one done for PDCM. From this first simulation, we obtain a trace T0. Then we perform a second simulation by giving a long latency to each request. We set the request latency to a value greater than or equal to L, where L is typically greater than the greatest latency that may be experienced when using the core model, e.g., L = 1000 cycles. We give to certain requests a latency greater than L: we set the latencies so as to force the completion times of successive data requests to be separated by L cycles or more. We obtain from this second simulation a trace TL. Both T0 and TL contain some timing information for each retired µop.
A BADCO model is then built from the information contained in T0 and TL. The information in TL is used to find (direct and indirect) dependences between requests. Dependences include not only data dependences, but also branch mispredictions, limited resources (reservation stations, MSHRs, ...), etc. We do not perform any detailed analysis of these dependences during trace generation. Instead, dependences are found indirectly by analyzing the timing information in TL. We use the fact that, if a request R2 is issued before a previous request R1 is completed, R2 does not depend on R1. If R2 depends only on R1, R2 is often issued a few cycles after R1 completes. That is basically how we detect dependences. Forcing successive requests in TL to occur at intervals no less than 1000 cycles is for disambiguation: R1 is the request whose completion time is closest to the issue time of R2. Of course, this method is not 100 % reliable, but it works well in practice.
The BADCO Machine
A BADCO machine is an abstract core that fetches and executes nodes. A node N i represents a certain number S i of retired µops (not necessarily contiguous in sequential order). S i is the node size. The sum of all nodes sizes, i S i , is equal to the total number of µops executed. As the BADCO machine works on nodes instead of µops, the bigger the nodes, the greater the expected simulation speedup. The next section explains how we build the nodes. A node N i also has a certain latency in clock cycles, called the node weight W i .
Some nodes, called request nodes, carry one or several requests to the uncore. There are three sorts of request nodes: I-nodes, L-nodes and S-nodes. An I-node may carry three sorts of requests: IL1 miss, ITLB miss or instruction prefetch requests. An Lnode (or S-node) carries the requests attached to one load (or store) µop (DL1 miss, DTLB miss, write-back, DL1 prefetch). 6 An L-node or S-node can also be an I-node. In the BADCO model, a node may be dependent on one older request node, called the dependency node.
During the trace-driven simulation, the BADCO machine fetches nodes and inserts them in the BADCO window in sequential order. I-nodes send their requests to the uncore at fetch time. Node fetching imitates what the real core does. 7 The BADCO window emulates the real core reorder buffer (ROB). When the sum of nodes sizes inside the window does not exceed the ROB size, the next node can be fetched. Otherwise node fetching is stalled. Once in the window, nodes can start executing. An L-node may send its requests as soon as its dependency node is completed. An L-node is considered completed when all its requests are finished. Other nodes are considered completed when their dependency node is completed. Nodes are retired from the window in the order they were fetched. A node is ready for retirement when it is completed and it is the oldest node in the window. The retirement of a node N i from the window actually happens exactly W i cycles after the node is ready for retirement. After being retired from the window, an S-node is sent to a post-retirement store queue, imitating what the real core does with stores. The requests carried by an S-node are issued to the uncore after retirement. The BADCO machine models the occupancy of the MSHRs inside the core. It imitates, to the extent possible, how the real core manages the MSHR. 8 In particular, a request requiring an MSHR entry must wait until there is a free MSHR entry before being sent to the uncore.
BADCO Model Building
The BADCO model building phase consists in grouping µops with the same dependencies to form nodes, and in defining the dependencies among nodes. Traces T0 and TL provide the information for this process.
Both traces T0 and TL in the left part of Fig. 5 represent the same sequence of dynamic µops in program order. The µops in T0 are annotated with their retirement time "RT". The µops in TL are annotated with their issue time "IT" and completion time "CT". Some µops carry one or several requests, they are called request µops. 9 All other µops are called non-request µops. Figure 5 uses dark-gray circles for request µops and light-gray circles for non-request µops. A request µop and the non-request µops following it until the next request µop form a run.
For each µop X, we define its dependency µop D(X) as follows: D(X) is the request µop before X and closest to X whose CT is less than the I T of X. 10 For example, µop H in Fig. 5 
has I T = 1016, the closest request µop with CT < 1016 is µop A with CT = 1005, then D(H ) = A.
We process traces T0 and TL simultaneously and µop by µop, in lockstep fashion. For each µop, we determine if the µop starts a new node or if it is attributed to an existing node. Every request µop X starting a run creates a new node N j to which it is attributed. The dependency node D(N j ) of N j is the node to which D(X) has been attributed. All subsequent µops attributed to the same node must have the same dependency µop. In particular, all the µops in the run with the same dependency X are attributed to node N j . If a non-request µop cannot be attributed to any of the nodes already created for that run, we create a new node for the µop.
Attributing a µop to a node N i means incrementing the node size S i and adding to the node weight W i the difference between the retirement time of the µop in T0 and that of the previous µop. By doing so, the sum of all nodes weights, i W i , equals the total execution time when all the requests to the uncore have a null latency.
The central part of Fig. 5 presents step by step the building process of nodes.
Step 1 processes µop A; A is a request µop and starts the node N1 with W = 10, S = 1, and D(N 1) = 0.
Step 2 processes µop B; B is a non-request µop with D(B) = 0, and as consequence, it is attributed to N1 with D(N 1) = 0. The properties of N1 are updated, the size S is incremented, and 1 cycle is added to the weight W because A attributed to N1) . The µop C cannot be attributed to the node N1 because all µops in N1 have a null dependency and C depends on A. Steps 5 and 6 attribute µops D and E to nodes N1 and N2 respectively.
Step 6 processes the request µop F and starts the processing of the second run of µops. We create a new node N3 with W = RT F − RT E = 3, S = 1 and D(N 3) = N 1 (A attributed to N1).
Step 7 processes the non-request µop G; G starts a new node N4 because D(G) = 0 and cannot be attributed to N3. Note that G cannot be attributed to N1 either because N1 belongs to the previous run. The building process continues in a similar fashion for the subsequents µops. The right part of Fig. 5 presents the final BADCO model.
Experimental Evaluation
The detailed simulator used for this experiment is Zesto [18] . Some of the characteristics of the core and uncore configurations we consider are given in Tables 1 and 2 respectively. We consider 3 different core configurations: "small", "medium" and "big". The L2, LLC and memory bus each can have a low or high value. This defines up to 8 different uncore configurations. For instance, the configuration denoted "010" has a small L2, a big LLC, and a narrow memory bus. The "big" core is the default core configuration. The default uncore configuration is "001". We will not present results for configurations "100" and "101" since they are not realistic.
For generating traces T0 and TL, we skip the first 40 billion instructions of each benchmark, and the trace represents the next 100 million instructions (no cache warming was performed). We assume that simulations are reproducible, so that T0 and TL represent exactly the same sequence of dynamic µops. We used SimpleScalar EIO tracing feature [1] , which is included in the Zesto simulation package. We present results for the SPEC CPU2006 benchmarks that we are able to run with Zesto. We have also included two SPEC CPU2000 benchmarks, vortex and crafty. We have chosen these two benchmarks because they experience a relatively high number of instruction misses and branch mispredictions, which is interesting for testing the models. All the benchmarks were compiled with gcc-3.4 using the "−O3" optimization flag.
Metrics
The primary goal of behavioral core modeling is to allow fast simulations for studies where the focus is not on the core itself, in particular studies concerning the uncore. Ideally, a core model should strive for quantitative accuracy. That is, it should give absolute performance numbers as close as possible to the performance numbers obtained with detailed simulations. Nevertheless, perfect quantitative accuracy is difficult, if not impossible to achieve in general with a simple model. Yet, qualitative accuracy is often sufficient for many purposes. Qualitative accuracy means that if we change a parameter in the uncore (i.e., memory latency), the model will predict accurately the relative change of performance. Indeed, if we use behavioral core modeling in a design space exploration for example, more important than being accurate in the final cycle count is being able to estimate relative changes in performance among the different configurations in the design space. Therefore we use several metrics to evaluate the PDCM and BADCO core models.
The CPI error for a benchmark is defined as CPI error = C P I re f − C P I model C P I re f where C P I re f is the CPI (cycles per instruction) for the detailed simulator Zesto, and C P I model is the CPI for the behavioral core model (PDCM or BADCO). The CPI error may be positive or negative. The smaller the absolute value of the CPI error, the more quantitatively accurate the behavioral core model. The average CPI error is the arithmetic mean of the absolute value of the CPI error on our benchmark set.
For a fixed core, we define the relative performance variation RPV of an uncore x yz as R PV = C P I 001 − C P I xyz C P I 001
where C P I 001 is the CPI of the uncore configuration "001" and C P I xyz is the CPI of uncore configuration x yz (see Table 2 ). The model variation error is defined as
where R PV re f is the RPV as measured with the detailed core model and R PV model is the RPV obtained with the behavioral core model (PDCM or BADCO). The smaller the variation error, the more qualitatively accurate the behavioral core model. When the variation error is null, it means that the behavioral core model predicts for uncore x yz the exact same performance variation relative to the reference uncore as the detailed core model. The average variation error is the arithmetic mean of the variation error on our benchmark set. 6.
2 Quantitative Accuracy Figure 6 shows for each benchmark the CPI error of PDCM and BADCO for the "small", "medium" and "big" cores, with the uncore configuration "001". The maximum error is on libquantum, both for PDCM and BADCO and for the three core configurations. This is consistent with the non-convex curve of libquantum shown in Sect. 3, indicating an inherent modeling difficulty. Table 3 gives the average CPI error of PDCM and BADCO. BADCO is on average more accurate than PDCM for each of the three core configurations. Figure 7 shows the Relative Performance Variation (RPV) of Zesto, PDCM and BADCO for the six uncore configurations "000", "010","011", "110" and "111" (see Table 2 ), assuming a "big" core. The baseline uncore is ''001". Both PDCM and BADCO exhibit a reasonably good qualitative accuracy, i.e., they predict approximately how performance changes when we change the uncore. Neither PDCM nor BADCO are very good at predicting tiny performance changes (RPV of a few percents), but they are relatively good at predicting important performance changes. This makes PDCM and BADCO suitable for design space exploration, e.g., for selecting some "interesting" uncore configuration for which more detailed simulations will be done. Table 4 gives the average variation error of PDCM and BADCO. BADCO is on average more accurate than PDCM for each of the 5 uncore configurations.
Qualitative Accuracy

Simulation Speed
We did all the simulation speed measurements on the same machine, which features an Intel Xeon W3550 (Nehalem microarchitecture, 8 MB L3 cache, 3.06 GHz) with Turbo Boost disabled and 6 GB of memory. All the simulation input files, including the traces for PDCM and BADCO, were stored on the local disk of that machine. Zesto, PDCM and BADCO were compiled with gcc-4.1 using the "-O3" optimization flag. We simulated the "big" core configuration and two different uncore configurations: one is the Zesto uncore configuration "001", the other is a simplistic uncore forcing all requests latencies to a null value. With the simplistic uncore, what we measure is essentially the simulation time for the core alone. Figure 8 shows the simulation time in millions of instructions simulated per second for Zesto, PDCM and BADCO.
The simulation speedup achieved with PDCM or BADCO, in comparison with Zesto, is typically between one and two orders of magnitude. Benchmarks with the greatest speedups are the ones with the fewest L1 misses. The Table 5 gives the Relative performance variation (RPV) of Zesto, PDCM and BADCO for the uncore configurations a "000", b "010", c "011", d "110" and e "111", assuming a "big" core. The baseline uncore is "001" Table 4 Average variation error using as reference the configuration "001"
"000" (%) "010" (%) "011" (%) "110" (%) harmonic mean on our benchmarks of the simulation speed in millions of instructions simulated per second (MIPS). PDCM is generally faster than BADCO because a BADCO node represents about 50 µops on average (harmonic mean on our benchmarks), whereas a PDCM trace item represents on average 90 µops. Hence PDCM works at a larger granularity.
The PDCM and BADCO models we have implemented can be connected to a detailed uncore model. This means that the core does not know the request latency when it sends a request to the uncore. Hence the core model inspects each clock cycle in case an event occurs, which limits the simulation speedup. We believe that higher speedups might be achieved. We are currently investigating the possibility to use information from the uncore that could allow the core model to not inspect every clock cycle.
Modeling Multicore Architectures with BADCO
In recent years, research in microarchitecture has shifted from single-core to multicore processors. Cycle-accurate models for many-core processors featuring hundreds or even thousands of cores are out of reach for the simulation of realistic workloads. Approximate simulation methodologies that trade accuracy for simulation speed are necessary for conducting certain research, in particular for studying the impact of resource sharing between cores, where the shared resource can be caches, on-chip network, memory bus, power, temperature, etc.
Behavioral core models are one option to trade accuracy for simulation speed in situations where the focus of the study is not the core itself but what is outside the core, i.e., the uncore. In Sects. 4 and 5, we presented two behavioral core models: PDCM, a previously proposed core model that we have extended to model realistic superscalar processors; and BADCO, a new behavioral core model that is more accurate than PDCM. Both core models enable fast simulation of multicore architectures when the design target is the uncore. In this section we evaluate the speed and accuracy of BADCO when simulating multiprogram workloads for processor configuration of 2, 4 and 8 cores.
Extending BADCO to execute multiprogram workloads is straightforward. Once BADCO core models have been built for a set of single-thread benchmarks, the core models can be easily combined to simulate a multi-core running several independent threads simultaneously. We connect several BADCO machines, one per core, to a detailed simulator of the uncore. 11 A BADCO machine communicates with the uncore by sending requests and receiving the acknowledgment of the completion of those requests. BADCO machines send read and write requests to the uncore. A request indicates the type of transaction and the virtual memory address. The uncore simulator informs the BADCO machine when requests have completed.
There is a round robin arbitration to decide which BADCO machine can access the uncore. When the uncore receives a request, it translates the virtual address to a physical address. If a page miss occurs, BADCO allocates a new physical page. Once this is done, the uncore processes the request. The uncore notifies the BADCO machine about the completion of requests through a call-back. A request completes once it is fully processed by the memory hierarchy.
Analogously to Zesto, BADCO does not model physical page conflicts. In both Zesto and BADCO, the main memory is assumed infinite. That means that every time that a page miss occurs, BADCO allocates a new physical page. The assignation of physical pages to virtual pages is made in a sequential fashion. During trace generation, BADCO traces save the request's virtual addresses to ease multicore simulation.
Experimental Setup
Our experiments analyze the performance of multicore processors with 2, 4 and 8 identical cores. Table 6 presents a summary of cores characteristics. A case study with 11 The uncore simulator was extracted from Zesto. five uncore design points is evaluated, each design point corresponding to a different replacement policy in the shared last-level cache: LRU, RANDOM (RND), FIFO, DIP and DRRIP. Table 7 gives the uncore characteristics. We build a sample of 250 random multiprogram workloads from 22 of the 29 SPEC CPU2006 benchmarks (the 22 benchmarks that we were able to simulate with Zesto). We perform detailed simulation with Zesto and trace-simulation with BADCO for every design point and every workload in the sample. Then, we compare BADCO's accuracy in terms of CPI error and speedup error using Zesto as reference. Finally, we present the average simulation speed of both BADCO and Zesto.
All the benchmarks were compiled with gcc-3.4 using the "−O3" optimization flag. For generating BADCO traces, we skip the first 40 billion instructions of each benchmark, and the trace represents the next 100 million instructions (no cache warming is performed). We assume that simulations are reproducible, so that traces represent exactly the same sequence of dynamic µops. We used SimpleScalar EIO tracing feature [1] , which is included in the Zesto simulation package.
During multiprogram execution, each core runs separate threads. When a thread has finished executing its 100 million instructions earlier than the other threads, it is restarted. This is done as many times as necessary until all the threads in the workloads have executed at least 100 million instructions. Performance is measured only for the first 100 million committed instructions of each thread. Figure 9 reports the measured and the estimated CPIs for Zesto and BADCO respectively. Each dot in the graph represents the CPI performance of individual benchmarks in the 250 workloads and the five design points. A perfect estimation would imply that all the dots lie on the bisector. In this case, we observe that most of the points are over the bisector. This indicates that BADCO tends to slightly underestimate the CPI. Table 8 presents the average of the absolute CPI error for 2, 4 and 8 cores and each design point. The global average of the absolute CPI error is 4.59, 3.98 and 4.09 % for 2, 4 and 8 cores respectively. The maximum error is in all cases less than 25 %. Moreover, for approximate simulators, more important than predicting CPIs accurately is predicting speedups accurately. We compared the speedups predicted by BADCO and Zesto for replacement policies FIFO, RANDOM, DIP and DRRIP using LRU as reference. We found that, on average, the global speedup error is 0.66, 0.61 and 1.43 % for 2, 4 and 8 cores respectively. Table 9 presents the individual speedup errors for four design pairs that use LRU as reference. Results show that BADCO is notably better at predicting speedups than raw CPIs. Table 10 reports the simulator performance of Zesto and BADCO. BADCO is clearly faster than the detailed simulator Zesto, with simulation speedups going from 15x to 68x when going from 1 to 8 cores. Zesto's speed decreases faster than BADCO's speed mainly because of memory management. Zesto must manage a memory space for each application. Such space reach typical sizes of 1GB or even more. When simulating many cores, Zesto does not have any other option but paging in order to keep running. BADCO does not have the same problem, and the decrease in simulation performance is mainly because of the increase of conflicts and work in the uncore simulation. These simulation times do not include the time spent generating BADCO models. Nevertheless, a benchmark can be integrated in many workload and many different simulations of the uncore and thus the one time cost for building a BADCO model is rapidly compensated by BADCO's speedup. It should be noted that BADCO still uses a detailed uncore simulator whose simulation speed may not be optimal, thus limiting the potential speedup that BADCO can provide.
Experimental Results
Multicore Simulation Speed
Conclusion
We introduced BADCO, a new behavioral application-dependent model of superscalar cores. A behavioral core model is like a black box emitting requests to the uncore at certain times. A behavioral core model can be connected to a detailed uncore model for studies where the focus is not the core itself, e.g., design space exploration of the uncore or study of multiprogram workloads. We have extended PDCM, a previously proposed core model, in order to model more accurately realistic superscalar processors. We also propose BADCO, a new behavioral core model. A BADCO model is built from two detailed simulations. Once the time to build the model is amortized, important simulation speedups can be obtained. We have compared the accuracy of BADCO with that of PDCM. From our experiments, we conclude that BADCO is on average more accurate than PDCM, essentially because it is based on two detailed simulations instead of a single one for PDCM. With BADCO, the average of the absolute CPI error is less than 4 % for all configurations and benchmarks we have tested. We have also evaluated the accuracy of BADCO for simulating multiprogram workloads, the average of the absolute CPI error is less than 5 % for 2, 4 and 8 cores, and all evaluated configurations. Moreover, we have demonstrated that BADCO offers a good qualitative accuracy, being able to predict how performance varies when we change the uncore configuration in both single and multicore execution. So far, the simulation speedups we have obtained with BADCO are typically between one and two orders of magnitude compared with Zesto.
