Modeling and simulating pipelined .processors in procedural languages such as C/C++ requires lots of cost in handling concurrent events, which hinders fast simulation. A number of researches on simulation have devised speed-up techniques to reduce the number of events. This paper presents a new simulation approach developed to enhance the simulation of pipelined processors. The proposed approach is based on early pipeline evaluation that all the intermediate values of an instruction are computed in advance, creating a future state for the next instructions. The future state allows the next instructions to be computed without considering data dependencies between nearby instructions. We apply this concept to building a cycleaccurate simulator for a pipelined RlSC processor and achieve almost the same speed as the instruction-level simulator.
INTRODUCTION
Modem embedded systems usually include at least one programmable instruction set processor, as the software approach helps the designer catch up the shrinking time-to-market with less effort. The portion to be implemented in software or hardware is determined in the early design stage, and simulation, among several approaches, is a prevalent method to do this.
In the performance evaluation of an embedded system, timing accuracy of the instruction set simulator becomes more and more important. As the programmable core embedded in the system has interactions with memory, memory-mapped IiO and other system components, its timing accuracy has a significant impact on the confidence of the performance analysis [2] . While an instruction level processor simulator just connects the hardware which cooperates with the processor and the software Nnning on the processor at functional level, a cycle-accurate processor simulator helps software interact with hardware at the exact time. Additionally a cycle-accurate instruction set simulator can be used to validate the implementation of the processor by performing comparisons between the two states.
In the simulation of pipelined processors, there are several levels of abstraction from gate level to functional level. Gate-level models described in HDL are very accurate in dealing with detailed timing, but simulation speed is slow. And it is difficult to try a lot of design alternatives in the design space exploration stage. In contrast, functional-level models written in procedural programming languages like CIC++ are easier to modify and run faster. Unfortunately, however, building a cycle-accurate instruction set simulator requires a detailed simulation model to resolve the pipeline hazards [I] and often needs to rewrite it in another language. What is worse, it is required in handling pipeline hazards to analyze the data dependencies between the instructions within the pipeline, which prevents fast simulation. Therefore many simulation techniques are proposed to improve the simulation speed while maintaining the timing accuracy [3].
One approach is to provide an optimized simulation library that makes it easy to model concurrent behaviors. SystemC [I61 is an example of this approach. It exploits an object-oriented programming technique to provide methods IO describe concurrent behaviors of hardware components and corresponding simulation engines. A built-in message-passing mechanism is used in Incas [4] for a cycle-accurate model of UllraSPARC. This approach, however, does not improve the simulation speed in spite of thc efficient modeling capabilities.
A notable technique is proposed in [5], which introduces the delay operator to solve pipcline hazards automatically and implicitly. The delay operator is used when the source operand of a fetched instruction i s in the destination register of a previous instruction that is alive in the pipeline. In this case a delay operator is assigned to point the value that should be taken from the preceding context. However, since the delay operators can be locally connected and spread out over the pipeline stages, handling them makes another burden to fast simulation.
The compiled simulation is regarded faster than the interpretive simulation because the former benefits from a priori information, moving the frequent operations such as instruction decoding from the simulation run-time to the compile-time [ 6 ] [7] . Furthermore the compiled simulation exploits host machine inslNCtionS to speed up the simulation. This concept was introduced by HSS [SI, which converts the cycle-free portions of a circuit into machine codes optimized for a host machine that execute the simulation. Similarly, LISA [9] translates executables from one machine to another one like binary translation [IO] . Low-level code generation is presented in [ I l l to aggressively utilize host machine resources. However, even in the compiled simulation, run-time scheduling is still inevitable [I21 since not all the operations can be scheduled in the compile-time. Moreover, since a compiled simulator attempts IO resolve pipeline hazards statically at the compile time, the simulator must consider every possible case that can be encountered in dynamic scheduling.
These previous works focusing on the implementation of a cycle accurate instruction set simulator are commonly devoted to mimicking the pipelined processors, not considering the software characteristics of the simulator. Consequently, they just translate the concurrent behavior of the pipelined processor in high-level programming languages, which invokes a considerable amount of scheduling events. This paper proposes a new simulation technique to reduce these scheduling events. The main point is to provide correct values of the processor state at the right time, not how to calculate the values as in hardware.
The rest of this paper is organircd as follows. Section 2 presents the concept of the proposed simulation tcchnique based on early pipeline evaluation and how to apply the proposed technique lo an instruction set simulator by taking an example. Subsequently the overall structure of the simulator is described in Section 3. The implementation and the performancc of the proposed simulator are presented in Section 4.
EARLY PIPELINE EVALUATION
Pipelining is a key implementation technique to increase processor performance by overlapping the execution of multiple instructions. As the pipelining is to split the execution of an instruction into smaller pieces, the instructions alive in the pipeline can have a number of interactions caused by the pipeline hazards. In software modeling of a pipelined processor, however, a clock pcriod is not a physical value but a virtual concept. This implies the execution of an instruction docs not need to he split to fit into a short clock pcriod as in hardware. The early pipeline evaluation removes the necessity of considering pipeline hazards by executing the program effectively at instruction level, eventually improving the simulation speed. This is possible since the results from the previous instmctions are already available and all the subsequent pipeline operations are determined. The proposed simulation technique is a switchover from the conventional cycle-accurate simulator which attempts to mimic the hardware operations.
Future State
The resulting values from the early pipeline evaluation create a future state that is a reference state to be used in the computation of the following instructions. The future state is obtained by executing the program at instruction level, while the processor state resulted from executing a cycle-accurate pipelined model is called the orchilecturol slate. The architectural state is updated as the same way as in the original pipelined processor, providing an orchiteclurol view to the programmers. This scparation of states, future and architectural states, is also applied to implementing the precise interrupt handling in out-of-order processors [ 131.
As the processor state consists of a register filc and a data memory, the future state consists of a/idure register/i/e and a /;cure m m o q and the architectural state consists of an orclrileclu,ul rezister /ile and an archirecrurnl memmy. Unlike the register file, a memory buffer that has cntrics as many as thc numbcr of pipeline stages is used for the future memory state. As the memory is considerably large compared to the register file and the difference bctween the future memory and the architectural memory is small, having a duplicated memory to store the latest values is wasteful. Therefore a small memory buffer is used to keep the values of 'store' instructions alive in thc pipeline. Since these values are the more recent values than those in the data memory, they arc searched before directly accessing the data memory and bypassed to 'load' operations ifthe rcference address is found in the buffer.
The associative search is used to carry out the bypass between nearby memory instru~tions referencing the same address. For efficient associative search, a hash table is used. In the memory buffer, a number of LSBs of a memory address can be used as a key. When a 'store' instruction is fetched, the early pipeline evaluation inserts a value to the memory buffer, i.e. future state. When the instruction reaches the memory stage, this value is deleted and written to the data memoly, i.e. architectural state. Between these operations a 'load' instruction may access the same address as that of the 'store' instruction. The early pipeline evaluation searches the memory buffer using the LSBs of the address and the value is fetched from the buffer. If a 'load' instruction accesses a different address, the memory buffer retums nothing and the value is fetched directly from the data memory.
Early pipeline evaluation calculates all the intermediate values by exploiting the future state. This brings considerable speed-up over conventional cycle-accurate simulation. If the operation of an instruction is split into the corresponding pipeline stages, the instruction must be identified at every pipeline stage to do a proper processing for intermediate vaIues. Though it is not complex as much as in the decode stage, the decoding time at each pipeline stage takes a large portion of the simulation time. Moreover, not only the current instruction but also previous instructions must be considered to resolve the data dependencies. The proposed early pipeline evaluation removes the works and saves the simulation time significantly. Proceeding ahead, the next instruction sw, store word, is fetched at cycle t+l. The source operand is in the destination register of the previous instruction, thus requires a data forwarding in a cycle-based simulator. In the proposed simulation, however, since all the results of the add instruction has been already computed, the value of the source operand is readily available in the future file, and the results of the s w instruction can be computed from the future file without a forwarding mechanism. As the pipeline proceeds by a cycle, the update pointer of the add instruction is moved to the next stage, i.e. the decode stage and the intcrmediate values of stagc D arc updated to the architectural state. At the same time, the pointcr of the sw instmction points to the fetch stage as depicted in Fig. ?(b) .
Simulation Example
Thc third instruction beg, branch if equal, nceds to evaluate the branch condition: true if its hvo opcrands $5 and $6 have the identical values. Although $5 has a data dependency and $6 does not, both are fetched from the future register file. In this example the two registers have the same value, OxAD, thus the branch is taken. Therefore, the calculated pc points to the branch target address, labeled labell, whose actual address is 0x00400300. As MIPS has one branch slot, l u i , load upper immediate, which is located next to beq is fetched at cycle t+3. The source operand of this instruction is supplied by its code word, so the results are computed without invoking the future register file. At cycle t+4, the first instruction reaches its write-back stage, letting the $ 5 in the architectural register file be updated to the value OxAD as in Fig. 2(c) . Likewise, since the second instruction reaches the memory stage, the memory is updated and the corresponding element in the memory buffer is removed. Here we assume that the memory operation, either 'write' or 'read', is completed in one cycle, so that the pipeline proceeds without stall. Fig. 3 illustrates the organization of the simulator. As explained in the previous section, the simulator has the architectural state, i.e. the architectural register file and the architectural memory, which represents the real state of the processor and the future stale, i.e. the future register file and the future memory buffer, which is a reference state for the computation of the following instructions. As the pipeline advanccs, a new inStNCti0" is fetched and its intermediate values arc placed in the next circular buffer as numbered in Fig. 3 and the values from thc previous instructions are updated to the architectural register file and memory after the pipelinc latencies. Since the maximum number of instNCtiOnS alive in the pipeline is equal to the pipeline depth, the number of slots in the circular buffers is the same as the pipeline depth (n). Therefore the intermediate valucs computed for the (i+n)-th successively fetched instniction can be stored into Si as all the intermediate valucs of I, are already updated into the architectural state at that time.
SIMULATOR ORGANIZATION

Early Pipline Evsluatiition
Pipsline Depth=" Fig. 3 . Overall organization ofthe simulator.
While the future state is updated instantly, the architectural state is updated according to the pipeline latency. This latency, however, may be variable with the previous instructions. For example, the pipeline may be stalled due to lazy memory operation, multi-cycle instruction and so on. To achieve accurate performance analysis these pipeline stalls should be considered.
In the real time environment, the operation speed of memoly is usually slower than that of the processor core, as a result a 'load' or 'store' operation requires more than one cycle. Even though a cache is placed between the processor and the memory, a cache miss requires a multi-cycle memory operation, And practical processors often contain multi-cycle instructions that make the following instructions stalled. To resolve these situations, an active stage pointer is used to indicate an active stage set of instructions whose entries should be written to the architectural state at a cycle. Fig. 4 shows how the active stage pointer indicates the update sequence of the architectural state. Since only one entry of each slot in the circular buffers is active at every cycle, the active stage set consists of the entries of the slots located in the diagonal line starting from the active stage pointer which normally points the write-back stage of the instruction fetched before (n-1) cycles. The update of the architectural state occurs along the line staning from the active stage pointer. If a stall occurs at cycle f due to lazy memory operation as in Fig. 4(a) , the update stops at the stall stage, i.e. memory stage, and the active stage pointer remains there. Thc update starts again from the stall stage after the stall condition is removed. Therefore only the write-back stage is updated at C, and the following stages are updated at C,,, in Fig. 4@ ). To reflect accurate memory access cycles, stall information can be added to the memory operation instead of connecting a memory or cache simulator which may cause severe speed degradation.
Debugging Mode
A debugging mode is added to the simulator in order to make the user observe the pipeline registers by stepping through the program cyclc by cycle. Since the user might bc doubtful about the value o f a certain pipeline register, a simulator must provide a way to trace back the origin of the value. This is trivial if the value comcs solely from the preceding pipeline stage. However, if it is forwarded from the other part of the pipeline, tracing back becomes rather complicated because the proposed simulation technique does not accommodate any forwarding mechanism in nature.
In thc dcbugging mode the simulator keeps the number of preccding instructions, as many as the pipeline depth, which still reside in the pipeline. To trace the origin of a value, it looks up the preccding instructions to find out the data dcpendency using a convcntional depcndency-checking algorithm and indicates a proper forwarding path bascd on thc result of the evaluation. Although this is quite a cumbersome and time-consuming task, it is not n serious problcm because the debugging mode does not requirc fast simulation speed. The size of the machine description for the MIPS processor is 1078 lines. The generated simulators support the interactive mode to monitor the processor status step by step. Its functionalities are fully vcrificd using a reference simulator, SPlM [15] . DSPStone benchmark programs compiled with the GNU cross-compiler are used as simulation inputs. The simulation is executed on a SUN Blade2000 workstation and the result is presented in Fig. 5 .
EXPERIMENTAL RESULTS
The simulation results show that the simulation speed is 2.2 million cycledsec on the average, almost as fast as the instruction level simulator. And it is quite fast compared to other retargetable interpretive instruction set simulators. The authors of paper IS]
report that the average simulation speed of their simulator is 240 kilocycleslsec, and UPFAST [14] , is 200 kilocycleslsec. Even compared to the simulation speed of retargetable compiled simulators, the simulation speed achieved by the proposed simulation mcthod is comparable. For example, LISA framework achieves 2.5 million instructionslsec for ' f i r ' and 0.8 million instructionslsec for 'adpcm' (91. Note that the proposed method is incorporated into intcrprctivc simulation to provide flexibility and dcbugging convenience. Howcver, thc proposed approach is not limited to interpretive simulation, but can be easily incorporated to compiled simulation.
CONCULSION
This paper has presented a new simulation technique based on early pipeline evaluation and demonstrated the potential of the proposed approach by implementing an instruction set simulator. As opposed to real pipelined processors, the proposed approach computes all the pipelined operations of an instruction at a time including all the intermediate values. The early pipeline evaluation creates a future state to maintain the latest values ofthe processor state, which makes it possible to compute the subsequent values without considering the data dependency between nearby instructions. This technique can be applied not only to instruction set simulators but also to pipelined IP blocks even if they have different levels of abstraction.
