Abstract-At the Electronic System Level (ESL), design validation often relies on discrete event (DE) simulation. Recently, parallel simulators have been proposed which increase simulation speed by using multiple cores available on today's PCs. However, the total order of time in DE simulation is a bottleneck that severely limits the benefits of parallel simulation. This paper presents a new out-of-order simulator for multi-core parallel DE simulation of hardware/software designs at any abstraction level. By localizing the simulation time and carefully handling events at different times, a system model can be simulated following a partial order of time. Subject to automatic static data analysis at compile time and table-based decisions at run time, threads can be issued early which reduces the idle time of available cores. Our experiments show high performance gains in simulation speed with only a small increase of compile time.
I. INTRODUCTION
ESL design models specified in System-level Description Languages (SLDLs), such as SystemC [8] and SpecC [7] , are usually validated using simulation. The simulator is a regular discrete event (DE) simulator. Within a single process, multiple concurrent threads emulate the parallelism in the design model. Typically, the multi-threading model is cooperative (i.e. non-preemptive), which simplifies the communication through events and variables in shared memory. Recent works [10] , [11] , [2] aim to utilize the parallel computation resources available in multi-core CPUs that are common in today's host PCs. Here, an extended simulation kernel uses OS kernel threads and additional synchronization for running multiple threads in parallel on the available cores. However, the number of threads that can run in parallel at each scheduling step is often very limited. The inner loops for delta-cycle and simulation time update in DE simulation severely limit the usable parallelism.
In this work, we relax the global in-order event and timing update based on compile-time automatic static analysis of the threads and their potential conflicts. Using the analysis results, our extended simulation kernel can then at run-time quickly decide whether or not any conflicts between candidate threads exist. If not, it issues threads early (with local timestamps). In turn, parallelism is maximized and simulation speed increases.
In other words, we extend parallel ESL simulation by aggressive out-of-order execution for higher simulation speed while maintaining all SLDL semantics and accurate timing.
After a brief discussion of related work, Section II motivates our idea using a simple DVD player example. Section III presents our out-of-order parallel DE simulation in detail and Section IV shows its higher simulation speed in experiments.
A. Related Work
Parallel Discrete Event Simulation (PDES) is a well-studied subject [1] , [6] , [9] . Two major synchronization paradigms exist, namely conservative and optimistic [6] . Conservative PDES ensures in-order event execution. In contrast, the optimistic paradigm assumes that every event is safe when executed and rolls back when this proves incorrect. Often, the temporal barriers in the model prevent effective parallelism in conservative PDES, while rollbacks in optimistic PDES are expensive in implementation and execution.
C-based SLDLs use DE simulation driven by events and simulation time advances. To interpret "zero-delay" semantics of SLDLs, the notion of delta-cycles imposes a partial order on the events that happen at the same time [8] . PDES with delta-cycle notion has been also been explored. For example, [10] , [11] , [3] , [2] extend the SystemC and SpecC kernels respectively to allow parallel simulation on multi-core processors. [10] , [11] , [2] apply PDES to SystemC and SpecC targeting symmetric multi-processing (SMP) architectures by using conservative synchronization. However, as an obstacle, the global simulation time is shared by all threads.
II. MOTIVATION
While the reference simulators for both SystemC and SpecC are single-threaded, parallel approaches like [10] , [2] take advantage of the fact that several threads running at the same simulation time and delta-cycle can be issued in parallel. However, even these PDES approaches impose a total order on event delivery and time advance which makes delta-and timecycles absolute barriers for thread execution. More specifically, when a thread finishes its execution for a cycle, it has to wait until all other active threads complete their execution for the same cycle. Only then the simulator advances to the next delta or time cycle. Additionally available CPU cores are idle until all threads have reached the cycle barrier.
As a motivating example, Fig. 1 shows a high-level model of a DVD player which decodes a stream with H.264 video and MP3 audio data using separate decoders. Since video and audio frames are data independent, the decoders run in parallel. Both output the decoded frames according to their rate, 30 FPS for video (delay 33.3ms) and 38.28 FPS for audio (delay 26.12ms).
Unfortunately, regular PDES approaches cannot exploit the parallelism in this example. Fig. 2(a) shows the thread scheduling along the time line. Except for the very first scheduling step, only one thread can run at any time. Note that it is not data dependency but only the global timing that prevents parallel execution in the simulator.
In this paper, we break the simulation-cycle barrier and let data-independent threads run out-of-order and in parallel. By carefully analyzing potential data dependencies and coordinating local time stamps for each thread, we fully maintain accuracy in simulation semantics and time. Fig. 2(b) shows this idea for the DVD player example. The MP3 and H.264 decoders run in parallel and maintain their own time stamps. As a result, we significantly reduce the simulator run time.
III. OUT-OF-ORDER PARALLEL SIMULATION
Regular DE simulation imposes a total order on event processing and time advancing, reducing the potential for parallel execution. We now propose a new out-of-order simulation scheme where timing is only partially ordered. We localize the global simulation time (time, delta) for each thread and allow threads without potential data or event conflicts to run ahead of time while other working threads are still running with earlier timestamps. To avoid any read-after-write (RAW), write-after-read (WAR), and write-after-write (WAW) hazards on shared variables, we use static analysis to detect potentially conflicting code segments.
A. Definitions
To formally describe our out-of-order PDES algorithm, we introduce the following notations: 1) We define simulation time as tuple (t, δ) where t =time, δ =delta-cycle, and order time stamps as follows:
• equal: (t1, δ1) = (t2, δ2), iff t1 = t2, δ1 = δ2
• before: (t1, δ1) < (t2, δ2), iff t1 < t2, or t1 = t2, δ1 < δ2
2) Each thread th has its own time (t th , δ th ).
3) Since events can be notified multiple times and at different simulation times, we note an event e notified at (t, δ) as tuple (id e , t e , δ e ) and define: EVENTS= ∪EVENTS t,δ where EVENTS t,δ = {(id e , t e , δ e ) | t e = t, δ e = δ)} 4) For regular DE simulation, typically several sets of queued threads are defined, such as QUEUES = {READY, RUN, WAIT, WAITFOR}. These sets exist at all times and threads move from one to the other during simulation, as shown in Fig. 3 (a). Now, for our out-of-order PDES, we define multiple sets with different time stamps, which we dynamically create and delete as needed, as illustrated in Fig. 3(b) . Specifically, we define:
• QUEUES = {READY, RUN, WAIT, WAITFOR, JOINING,
COMPLETE}
• READY = ∪READY t,δ , READY t,δ ={th | th is ready to run at (t, δ)} • RUN = ∪RUN t,δ , RUN t,δ ={th | th is running at (t, δ)} • WAIT = ∪WAIT t,δ , WAIT t,δ ={th | th is waiting since (t, δ) for events (ide, te, δe), where (te, δe) ≥ (t, δ)} • WAITFOR = ∪WAITFOR t,δ , WAITFOR t,δ ={th | th is waiting for simulation time advance to (t, 0)} • JOINING = ∪JOINING t,δ , JOINING t,δ ={th | th created child threads at (t, δ), and waits for them to complete} • COMPLETE = ∪COMPLETE t,δ , COMPLETE t,δ = {th | th completed its execution at (t, δ)} Note that our implementation orders theses sets by increasing time stamps for efficiency. 5) Initial state at the beginning of simulation:
Let THREADS be the set of all existing threads. Then, at any time, the following conditions hold:
At any time, each thread belongs to exactly one set, and this set determines its state. Determined by the scheduler, threads change state by transitioning between the sets, as follows: The thread and event sets evolve during simulation as illustrated in Fig. 3 . Whenever the sets READY t,δ and RUN t,δ are empty and there are no WAIT or WAITFOR queues with earlier timestamps, the scheduler deletes READY t,δ and RUN t,δ , as well as any events with the same timestamp EVENTS t,δ .
B. Out-of-order Scheduling Algorithm
Algorithm 1 shows the scheduling algorithm of our outof-order parallel DE simulator. At each scheduling step, the scheduler first evaluates notified events and wakes up corresponding threads in WAIT. If a thread becomes ready to run, its local time advances to (t e , δ e + 1) where (t e , δ e ) is the timestamp of the notified event (line 5 in Algorithm 1). After event handling, the scheduler cleans up any empty queues and expired events and issues qualified threads for the next deltacycle (line 18). Next, any threads in WAITFOR are moved to the READY queue corresponding to their waiting time and issued for execution if qualified (line 28). Finally, if no thread can run (RUN = ∅), the simulator reports a deadlock and quits 1 . Note that our scheduling is aggressive. The scheduler issues threads for execution as long as idle CPU cores and threads without any conflicts (HasNoConflicts(th)) are available.
Note also that we can easily turn on/off the parallel out-oforder execution at any time by setting the numCPUs variable. For example, when in-order execution is needed during debugging, we set numCPUs = 1 and the algorithm will behave the same as the traditional DE simulator where only one thread is running at all times.
C. Static Conflict Analysis at Compile-Time
We use static analysis of the application code to determine whether or not a thread is qualified to run early/out-of-order. In particular, we have to prevent parallel data access to shared variables, namely read-after-write (RAW), write-afterread (WAR), and write-after-write (WAW). Fig. 4 shows a simple example of a WAW conflict where two threads th 1 and th 2 write to the same variable i at different times. Simulation semantics require that th 1 executes first and sets i to 0 at time (5, 0), followed by th 2 setting i to its final value 1 at time (10, 0). Now, if our simulator would issue the threads th 1 and th 2 out-of-order, we would create a race condition, making the final value of i non-deterministic. Thus, we must not schedule th 1 and th 2 out-of-order. Note, however, 1 The condition for a deadlock is the same as for a regular DE simulator.
that • Segment s i : code portion executed by a thread between two scheduling steps.
• Segment Boundary v i : SLDL statements which call the scheduler, i.e. wait, waitfor, par. Note that segments s i and segment boundaries v i form a directed graph. s i is the segment followed by segment boundary v i . v i can be followed by multiple segment boundaries, and s i can be composed of multiple code portions.
• Segment Graph (SG): SG=(V, E), where V = {v | v is a segment boundary}, E={e ij | e ij is the code portion between v i and v j , where v j is reached after v i }.
• Segment Conflict waitfor) into nodes and all possible flows of control into edges. Here segment node 3 corresponds to the wait e2 statement. From there, control reaches either node 4 (wait e3) through blocks e, g, h or node 5 (wait e4) through blocks e, g, j.
For the general case, our compiler uses Algorithm 2 to traverse an application's CFG following all branches, function calls and threads, and recursively build the corresponding SG.
2) Computing the Segment Conflict Table: Based on the SG, we can easily compute a table of conflicting segments.
First, we compile for each segment a variable access list which contains all variables accessed in the segment. Each entry is a tuple (Symbol, AccessType) where Symbol is the variable and AccessType specifies read-only (R), write-only (W), read-write (RW), or pointer access (Ptr).
For example, a statement a = a + b creates an access list {a(RW), b(R)}.
Our compiler computes the variable access lists for each segment during the generation of the SG (line 43 in Algorithm 2, ExtendAccess()). Note that we currently do not perform any pointer analysis (future work). Instead, we conservatively mark all segments with pointer accesses (Ptr) as conflicting. However, we do follow port mappings through the structural hierarchy of the design model and store the actual target variables in the access list.
Finally, we create the segment conflict 
3) Scheduling Conflict Detection:
While the segment graph and conflict table are built at compile time, the simulator needs to check at run-time whether an available thread at a particular segment can be issued out-of-order, i.e. without conflict. To do this efficiently, we use a table-lookup in CT ab [i, j] and only run our out-of-order scheduler when a CPU core is idle.
In order to provide the scheduler with the next segment a given thread is about to execute, our compiler instruments the SLDL code such that the segment ID is passed to the scheduler as an additional argument when the thread executes a wait, waitfor, or other scheduling statement. At run-time, the scheduler then calls the HasNoConflicts(th) function to determine whether or not to issue the thread th early. As shown in Algorithm 3, the HasNoConflicts(th) function checks for potential conflicts with all parallel running threads (in RUN), as well as all waiting threads in the READY and WAIT queues with an earlier time stamp than th. Note that each check can be performed in constant time (O(1)) due to the table-lookup in function Conflict(th1, th2). 
IV. EXPERIMENTS AND RESULTS
We have implemented the proposed out-of-order parallel simulator in a SpecC 2 -based system design environment [5] and conducted experiments on three multi-media applications shown in Fig. 6 . To demonstrate the benefits of our out-oforder PDES, we compare the compiler and simulator run times with the traditional single-threaded reference and a regular parallel implementation [2] without out-of-order scheduling. All experiments have been performed on the same host PC with a 4-core CPU (Intel (R) Core (T M ) 2 Quad) at 3.0 GHz. A
. An Abstract Model of a DVD Player
Our first experiment uses the DVD player model shown in Fig. 6(a) . Similar to the model discussed in Section II, a H.264 video and a MP3 audio stream are decoded in parallel. However, this model features four parallel slice decoders which decode separate slices in a H.264 frame simultaneously. Specifically, the H.264 stimulus reads new frames from the 2 Due to its similarity, our results are equally applicable to SystemC [4] . input stream and dispatches its slices to the four slice decoders. A synchronizer block completes the decoding of each frame and triggers the stimulus to send the next one. The blocks in the model communicate via double-handshake channels. According to profiling results, the workload ratio between decoding one H.264 frame with 704x566 pixels and one 44.1kHz MP3 frame is about 30:1. Further, about 70% of the decoding time is spent in the slice decoders. The resulting workload of the major blocks is shown in the diagram. Table I shows the statistics and measurements for this model. Note that the conflict table is very sparse, allowing 78.47% of the threads to be issued out-of-order. While the regular PDES loses performance due to in-order time barriers and synchronization overheads, our out-of-order simulator shows twice the simulation speed.
B. A JPEG Encoder Model
Our second experiment uses the JPEG image encoder model shown in Fig. 6(b) . The stimulus reads a BMP color image with 3216x2136 pixels and performs color-space conversion from RGB to YCbCr. Since encoding of the three color components (Y, Cb, Cr) is independent, our JPEG encoder performs the DCT, quantization and zigzag modules for the colors in parallel, followed by a sequential Huffman encoder at the end. The JPEG monitor collects the encoded data and stores it in the output file.
To show the increased simulation speed also for models at different abstraction levels, we have created four models (spec, arch, sched, net) with increasing amount of implementation detail, down to a network model with detailed bus transactions. Table II lists the PDES statistics and shows that, for the JPEG encoder, about half or more of all threads can be issued out-of-order. Table III shows the corresponding compiler and simulator run times. While the compile time increases similar to the regular parallel compiler, the simulation speed improves by about 138%, more than 5 times the gain of the regular parallel simulator. traditional single-threaded reference implementation, as well as a regular multi-core parallel simulator.
Our out-of-order PDES technique fully maintains SLDL simulation semantics and is applicable, without loss of accuracy, to C-based system-level models at any abstraction level.
In future work, we will optimize the static code analysis and look into additional methods to further improve the simulation speed.
