Requiring no functional simulation, trace-driven simulation has the potential of achieving faster simulation speeds than execution-driven simulation of multicore architectures. An efficient, on-the-fly, high-fidelity trace generation method for multithreaded applications is reported. The generated trace is encoded in an instruction-like binary format that can be directly "interpreted" by a timing simulator to simulate a general load/store or x8-like architecture. A complete tool suite that has been developed and used for evaluation of the proposed method showed that it produces smaller traces over existing trace compression methods while retaining good fidelity including all threading-and synchronization-related events.
INTRODUCTION
Exploring different many-core architectures through simulations represents a huge challenge for both architects and application developers. This is due to the complexity of these new architectures with their many cores, interconnection network, and complex cache hierarchies. Many approaches have been proposed to simulate such complex architectures with adequate speed and fidelity. The most widely used approach is to parallelize software simulations of multicores using multiple simulation threads that run on multicore machines (Argollo et al. 2009; Jun et al. 2014; Miller et al. 2010; Pengju et al. 2012) . Another approach for accelerating many-core simulations is to use Field-Programmable Gate Arrays (FPGAs) Tan et al. 2010) . FPGAs offer massive Next, Section 2 presents a survey of related work followed by the details of the basic strategy and methods developed to generate CET code in Section 3. This includes a description of the chain of phases the input application undergoes to generate the compact trace. Experimental results using the developed tool suite are presented in Section 4. Finally, conclusions and future work are presented in Section 5.
RELATED WORK
An address trace compression technique based on loop detection was proposed in Elnozahy (1999) . Control-flow analysis (scanning the trace and finding repeated patterns) is used to detect loops that have constant or varying-by-constant addresses. Complex situations in which loops have function calls and/or complex structures were not handled. In Milenkovic and Milenkovic (2003) , a stream-based compression (SBC) was proposed. Stream blocks (instructions between branching) are replaced by assigned indices, exploring locality. However, when the number of stream blocks grows, the compression ratio decreases. Later on, the authors further improved their SBC method and utilized variable field-length encoding/decoding (Milenkovic and Milenkovic 2007) . SBC then achieved better compression ratios than PDATS (Johnson et al. 2001) , LBTC (Luo and John 2004) , and TCgen (Burtscher et al. 2005) . To achieve more compression, the original trace's, a Dinero trace (Edler and Hill 1998) , instruction and data addresses were separated into two files and removed the 2-bit headers that indicated the type of address. This boosted the compression ratio since the headers were noncompressible, but significant details were lost, which limits the use of such traces in simulations. In Chen et al. (2013) , a lossless trace compression technique that exploits spatial and temporal locality was proposed. The technique is limited to instructions and their addresses only; that is, data addresses are not covered. Instruction addresses were classified into two categories: (1) sequential addresses with a constant difference between any two consecutive addresses and (2) nonsequential addresses with variable address strides. The input trace consists of pairs of numbers: the instruction address and the instruction itself. The output consists of three components: the static program, the sequent address file, and nonsequential addresses. The second file consists of a very-long-bit vector, where each bit corresponds to a trace element to indicate whether the address is sequential or not. The authors have compared their compression rate to those of PDATS, PDI (Johnson et al. 2001) , LBTC (Luo and John 2004) , and SBC (Milenkovic and Milenkovic 2003) . Their technique outperformed all of these techniques. Budanur et al. (2012) proposed a memory trace compression technique for SPMDs (single program multiple data). It is based on PRSD (power regular section descriptor) abstractions (Marathe et al. 2003; Noeth et al. 2009 ), but it is finer grained. They called it EPRSD (extended PRSD). A PIN-based instrumentation tool (memtrace) takes an SPMD application as input and generates its memory trace on the fly. The generated trace is compressed using EPRSD. The memtrace tool runs as a set of MPI processes. Each process instruments an SPMD program and outputs the trace into a pipe. The trace compressor consumes the trace from the pipe. The compressor performs intrathread compression on the fly utilizing the repetitive patterns. After instrumentation terminates, it performs interthread compression by factoring out the common parts among threads and finally performs interprocess merging among all processes of the SMPD application. It reduced the trace size by half for the AMG benchmark. Unlike our technique, this technique requires a decompression phase. Janapsatya et al. (2007) proposed a trace compression technique for instructions' addresses alongside an instruction cache analysis method. Their main objective was not to maximize the compression ratio but to accelerate trace processing. This technique is limited to instructions' addresses only. Their technique achieved a simulation speed up to 9.67 over existing techniques, but the compression ratio was 2 to 10 times worse than Gzip. In Burtscher et al. (2005) , four VPC (value prediction-based compression) algorithms were introduced, namely, VPC1, VPC2, VPC3, and VPC4. The input trace consists of pairs of numbers; the first is a 32-bit PC, and the second is a 64-bit extended data (ED). VPC algorithms use predictors to predict the next value based on the previously observed values. If the next value is predicted correctly, the index of the predictor that predicts it is retained. The unpredicted values are outputted to a different stream. If more than one predictor predicts a certain value, there are heuristics to select the best one. For example, VPC1 uses Huffman encoding. If more than one predictor is correct, then the shortest Huffman code is selected. Because the number of predictors is small, the number of bits to encode the predictor's index is smaller than the corresponding trace element. Therefore, the trace is compressed. The same algorithm is applied in the reverse manner to decompress the compressed trace. The TCgen (Burtscher et al. 2005 ) is a VPC tool that autogenerates a value-prediction-based trace compressor based on user specifications. In Asanovic et al. (2005) , a technique for directory and cache state reconstruction (similar to warming up) to accelerate sampled multiprocessor simulation is proposed. A software structure called MTR (Memory Timestamp Record) that can be updated during fast-forwarding (functional simulator that updates the architectural state in between sampling points) is used. For each memory block (cache block), there is an MTR record that registers the ID of the last processor that modified this block, the timestamp of the last write operation, and an array of timestamps of the read operations on the block (each timestamp per processor). During fast-forwarding, a read/write operation will update the MTR record. The directory and cache state reconstruction occurs right before each sampling point. This is done in two steps: (1) determining the subset of blocks that are still cached and (2) checking cross-processor interactions to determine which of these blocks should be valid or dirty according to the cache coherence protocol. This technique, however, works for sampled execution-driven simulations but not for trace-driven simulations. In Barr and Asanovic (2006) , a technique to compress branch trace information to be used in snapshot-based microarchitecture simulation is introduced. The compressed trace can be used to warm any arbitrary branch predictor's state before timing simulation of the snapshot. However, this technique is specific for branch information. A trace compression algorithm based on simple loop detection using data addresses was proposed in Ketterlin and Clauss (2008) . A data address trace is scanned to detect loop nests using the linear progressions of the addresses. The output is a sequence of loop nests. This algorithm can handle simple loops only and is limited to data addresses. In Zhang and Gupta (2005) , a unified trace representation dubbed whole execution traces (WETs) is proposed. WET was constructed by labeling a static program representation with profile information, which is then compressed in a two-step methodology that allows efficient traversal of the generated trace to extract information corresponding to individual profile types. The achieved compression rate was between 16 and 83. In Nilakantan et al. (2015) , the Valgrind dynamic binary instrumentation tool is used to generate traces for multithreaded applications that abstract the original instructions into three types of "events": computation (work performed local to a thread), communication (read/write dependencies between threads), and synchronization (embedded pthread calls for each thread). A separate trace is generated for each thread. The computation events contain counts of integer/floating-point operations, memory write/reads, and a list of their unique virtual memory addresses. No control flow events are captured. Butko et al. (2015) proposed using trace-driven simulations for multicores. Traces (one file/core) are collected from GEM5 multicore system simulations by capturing and storing all requests (core side) and responses (memory side) at the interface of each core's private cache memory. Again, no control flow is captured in the traces, and only trace information related to cache miss events is kept as only those result in external memory traffic (their objective of trace collection). Rico et al. (2011) also proposed a trace-driven methodology for multicore simulations. They capture full raw traces (no compression or compaction) of sequential sections of tasks or threads with calls to parallelization events (task/thread) spawning events. Other trace generation techniques such as Bodik et al. (2006) , Bodik et al. (2003) , and PinPlay (Patil et al. 2010) are concerned with deterministic replay of the program by recording a fixed execution path for the nondeterministic events (e.g., thread interleaving and memory operation order). This deterministic replay is useful for software debugging and computer architecture simulation. However, since replay implies real execution of the program, these techniques do not work for trace-driven simulators because real execution requires functional units that are missing in such simulators.
CET GENERATION
The proposed trace generation method in this work translates a multithreaded input application's executable into another binary format called CET code. The latter encodes the original application static code and the data required for timing simulation in a compressed format. The data that cannot be compressed (i.e., embedded into the CET code) is kept aside and is called CET data. So, each thread of the application is translated into five files, namely, the CET code, branch results, jump displacements, loop counters (in the case of inner loops whose counters do not follow a certain pattern), and data addresses (for nonuniform data referencing). Interthread compression was not pursued in the current work to maintain the simplicity of the compressed trace (and hence eliminate decompression). The generated compact trace is only intended for simulation and not for debugging. It can be used by architectural/timing simulators to perform deterministic replay of the same trace on different architectures. It captures the multithreading synchronization events with CET-defined primitives to create, pause, resume, and terminate threads. These primitives are used to encode synchronization barriers, access to critical sections (locks/unlocks), and atomic read-modify-write operations in the generated CET trace. The resulting CET code size is still less than the application's executable size. The CET data file size varies depending on the application. CET code and data are generated once for a specific input program and dataset, but can be used to simulate many architectural configurations (general LOAD/STORE and x86-like core architectures). The CET code needs to be regenerated only when the number of threads changes. No timing information is recorded into the CET code as these will differ with different architectural configurations. For exact replay of multithreaded traces, PinPlay (Patil et al. 2010 ) can be used.
Basic Strategy
The basic strategy in the proposed trace generation and compaction methodology is to remove all possible redundancies, in both instructions and data memory references, from the execution trace while preserving a record of as many of the execution events as possible (i.e., fidelity) and maintaining simplicity for the generated compact trace. A static executable code (CET code) is generated along with the minimum execution data (CET data) from which most of the execution events can be reconstructed by a simulator. Several strategies have been employed to achieve these goals:
(1) Loop detection: This is the main strategy to produce static CET code that resembles the original application's executable code. The developed algorithm does that very well for all types of loops (simple/nested with constant/regular/variable counters). (2) Data/instruction addresses compression: Loop detection is just the first step; if all data/instruction references in the loop body are kept as is (for fidelity), then there will be no compression. Therefore, our strategy was to detect various types of data/instruction references and keep minimum information about them required to recalculate them during simulations. Hence, constant references or references with constant strides are encoded in the CET code itself. Furthermore, all the irregular references stored in the CET code and/or data are relative to previous references (i.e., only the difference from the previous reference is stored). CET data is generated as a FIFO, with the same order as the CET instructions that utilize them. Hence, the original trace can be readily reconstructed and simulated from both the CET code and CET data using a thread's initial instruction address. (3) Use minimum field sizes: All fields in the CET code and data (except for CET op-codes) are under the direct control of the user and can be set to the minimum required.
The generated CET code and data have the following features:
(1) The original program's execution order (control flow) is preserved without keeping any instructions' addresses except the initial thread address. The CET framework supports SIMD Load/Store instructions. They appear in the instrumented trace loads/stores with contiguous data addresses. This allows using the CET trace in simulating architectures that do not support SIMD. The simulator can treat load/store instructions as separate instructions with contiguous data addresses or as SIMD instructions. SIMD ALU instructions are currently treated as regular ALU instructions. It should also be noted that memory fencing instructions (that enforce a certain memory load/store order) will not affect the execution trace, which is always generated in order. Memory fencing instructions can be inserted in the generated trace when it is used in an architectural simulator that supports out-of-order execution. (2) The CET code and data format simplifies their generation and usage in simulations, specifically:
• All the fields in the CET instructions have fixed lengths. This is crucial for FPGA-based simulators where the CET trace would be stored in an on-chip or on-board memory.
Having fixed-width fields simplifies the code storage, fetching, and decoding.
• CET instructions that require no data or utilize CET data consist of the opcode field only.
• For other instructions, the required data is embedded in the CET instruction itself (e.g., the address of scalar LOAD/STORE instruction, the base address, and the stride value in the contiguous LOAD-C/STORE instructions, the LOOP counter, etc.). (3) The CET data files (for each thread) are: a. CET Addresses: These are the noncontiguous data load/store addresses. Only the difference from the previous address is encoded in the CET data, not the complete address, reducing their total size by at least 50%. The user specifies the size of this field (default is 16 bits). b. Branch Results: These are the results of conditional branch instructions (taken or not taken), when they are executed multiple times in a loop body (but not to implement the looping itself). The size of this field is 1 bit. c. Jump-M: Dynamic target addresses of unconditional jump, call, and return instructions, when the jump is to multiple locations. Only the displacement (in number of CET instructions) between the current instruction and the target instruction is stored. The user specifies the size of this field (default is 16 bits). d. Loop-R: Irregular inner-loop counters (number of iterations). This is for inner loops that have a different number of iterations per outer loop iteration, and these counters do not follow a certain pattern. Again, the user specifies the size of this field (default is 32 bits).
A special tool has been developed to verify the effectiveness of the proposed CET code generation methodology. It can be integrated with the trace generator (i.e., the functional simulator) or the instrumentation tool. This facilitates the start of the compact trace generation on the fly (i.e., while the program is being executed or emulated), making our method extremely efficient in terms of time and memory requirements. Instructions and function calls in the original execution trace are classified into one of 18 unique categories that belong to six different classes. These 18 categories are agnostic to any LOAD/STORE or x86-like general-purpose core architecture. They were chosen to be the minimum set that can be used by a trace-driven simulator to reconstruct execution events. Figure 1 shows the three phases of CET generation. The input is an executable file of the multithreaded program alongside its input data that goes through a three-tool chain: the profiler, the code generator and the emulator and CET data generator. Using PIN instrumentation, these phases are repeated for each thread in the input program. The final output of the tool comprises the CET code and data for each thread separately. Producing separate CET codes and data for threads allows simulators to process these threads in parallel. Moreover, the CET tool generates a log file of useful information for the user. It also generates the starting address of each thread, which represents the point of code at which the thread has to be created. Intel's Pin framework (Intel 2012) has been used for instrumentation in the current version of CET tool which restrict it to work in the user-mode only Other ISAs can be supported using instrumentation tools such as Valgrind (2000) . A full-system functional simulator can also be used to generate system-level code traces. It should be noted that such events as thread-interleaving won't affect capturing a thread instructions since each thread's trace is kept separate. So e.g. if a loop trace in a certain thread is interrupted due to thread interleaving, the trace will be continued when the thread resumes and the generated trace would not show the interleaving. Each category is assigned a unique CET code and has special arguments. Table 1 summarizes the different instruction classes, categories, their CET code format, and their corresponding CET data (if any). In addition to the CET codes' formats shown in Table 1 , the CET code contains the register numbers of the corresponding original instructions, allowing the simulator to capture hazards in the CET code. For the System calls class (all other system calls not related to synchronization), the unique system call identifier/number is encoded in the CET code using 10 bits. This is sufficient for all existing operating systems where the number of system calls does not exceed 500 ("Searchable Linux Syscall Table for x86 and x86_64"). The total number of CET instructions is 26, leaving six unused codes for new CET instructions if needed.
CET Profiler
In this phase, the application's profile is generated using functional simulation or native execution. Using instrumentation, information about executed instructions is collected: memory references accessed by loads/stores, branch results in the case of conditional branch, and so forth. The generated application's profile is an intermediate image of the application where each executed instruction is an object (organized as one object array per thread). Each object contains an opcode, a list of memory references, a list of loop counters, a list of conditional branch results (taken/not taken), and so forth. For each instruction type, the corresponding list is filled during profiling. The input application is profiled dynamically by instrumenting each instruction; when an instruction is encountered for the first time, a new object for this instruction is created and mapped to a unique location in the profiled image. If the same instruction address is encountered again later, its corresponding object is updated if required. Figure 2 shows the CET profiling algorithm.
Loop Recognition Algorithm
x86 architecture has multiple explicit loop instructions, namely, LOOP, LOOPE, LOOPNE, LOOPZ, and LOOPNZ. These instructions are directly detected by the CET profiler and encoded into CET loops. However, compilers often use the conditional branch instructions to encode loops. Therefore, there is a need to distinguish between the conditional branch instructions that implement loops and other conditional branches. Loops represent the main venue for an execution trace compaction. Moreover, detecting the x86 conditional branches that implement loops and translating them into CET loops will minimize the size of CET data significantly. For example, if all x86 conditional branches (CBs) are left as they are, then a loop of 1 million iterations will require storage for 1 million bits to store its branch's results (taken or not taken). However, with loop detection, this branch instruction is translated into a one-CET-loop instruction whose number of iterations is embedded into its body. Conditional branches implementing loops are distinguished from other 27:10
A. Hroub et al. CBs using a two-phase algorithm. The first phase identifies all repeated CBs as loop candidates. The second phase filters those repeated CBs that implement loops from regular CBs that are just repeated a number of times. The latter CBs are then switched back to normal conditional branches.
In the first phase, the CET profiler creates a list for each encountered CB instruction (identified by its address) and stores the branch's results for that CB instruction. Similar consecutive branch results (taken/not taken) are stored in one list's node, which has two fields: the result of the branch (T/NT) and a counter that represents how many consecutive times the branch was taken or not taken. As such, each loop candidate CB instruction will have a branch's results chain, as shown in Figure 3 ; for example, an inner loop with a constant count of 10 that is inside an outer loop of count 10 will have a chain of 10 T blocks, each having a count value of 10, and 10 NT blocks, each having a count of 1. Hence, an x86 conditional branch instruction is considered as a loop candidate under the following conditions:
1. All not-taken nodes have a counter of one. 2. The last node must be a not-taken node. 3. The first node can be either taken or not taken depending on if the loop is outer or inner. Thus, the loop instruction has a flag bit to indicate if the first node is taken or not. The loop body is the set of instructions between the CB and the target address. Referring to Table 1 , repeated loops (i.e., inner loops) will be encoded either as LOOP (with constant counter), LOOP-C (with a counter that increments/decrements by a constant amount), or LOOP-R (with random counters). This algorithm will detect all properly nested loops with forward or backward edges. It may also consider some conditional branches as loops that were not intended to be loops; for example, if a conditional branch is executed twice, being taken the first time and not taken the second time, it will be considered as a loop with one iteration. This behavior, however, is still correct. In addition, in case some improperly nested loops are not recognized as loops, they will be regular CBs, but the contiguous data addresses and instruction addresses within these loops will still be compressed. The only overhead is the inclusion of their associated T/NT bits in the CET data, which is not very significant. of L. Otherwise, if it is at the front of L, its counter is decremented (and popped out when its counter reaches zero). If the loop candidate is found in L but not at the front, this means it was just a CB that does not represent a loop, so it is removed from L and reverted to a regular CB CET instruction.
Thread Synchronization Events
Detecting thread synchronization events is straightforward with PIN instrumentation. PIN has what is called RNT object (routine object), which has many functions through which Pin can intercept different routines and function calls (including multithreading events, such as locks, unlocks, barriers, etc.). Hence, routines are intercepted by their names. Furthermore, PIN provides call-backs when each thread starts and ends. These are used by our tool to create the CET code and data files for each thread. As shown in Table 1 , only four CET instructions (START, PAUSE, WAKE, and TERMINATE) are used to encode thread synchronization events such as synchronization barrier, access to critical sections (Locks/Unlocks), and atomic (read-modify-write) operations. Figure 5 illustrates that a simulator can simulate different synchronization events encoded with these instructions only (with no timestamps in the CET trace). Thread creation is straightforward with START instruction as shown in Figure 5 (a). Thread synchronization at barriers can actually be modeled in two ways:
1. One that allows replicating the original thread arrival order at the barrier; see Figure 5 
(b).
A WAKE (W) instruction is inserted (at the point of arrival) in the latest thread that arrives at the barrier during trace generation. PAUSE (P) instructions are inserted at the arrival points of the other threads. The simulator in this case would pause all threads until all of them arrive at the barrier and then reset each thread's time counter to that of the thread with the W instruction.
27:12
A. Hroub et al.
2.
Another that allows a different thread arrival order (e.g., when simulating a different HW configuration than the one used for CET generation); see Figure 5 (c). Only P instructions are inserted at the arrival points of all threads. In this case, the simulator would pause each thread that reaches a P instruction, until all threads are paused, and then all threads are resumed from the time counter of the last thread that reached the P instruction.
Similarly, serialization of threads' access to a critical section can also be handled in two ways:
1. One that allows replicating the original thread access order to the critical section. All threads will have P instructions (before the critical section) and a W instruction after the critical section, except for the first thread to arrive at the critical section (during CET generation), which will only have a W instruction after the critical section (as shown in Figure 5(d) ). The simulator would pause each thread until its predecessor reaches the W instruction, and then that thread is granted access to the critical section and its time counter is set to the "W point" of the predecessor thread. 2. In the second method, all threads will have P instructions before the critical sections and W instructions afterward. The simulator in this case would use a "P" FIFO to pipeline thread requests to access the critical section. That is, if a thread reaches a P, it is queued in the "P" FIFO. Threads are then granted access to the critical section on a first-come-firstserved basis. When the current thread reaches the W instruction, it is de-queued from the "P" FIFO and the next thread in line accesses the critical section and so on. Hence, the time difference between queuing a thread and granting it access represents the actual simulated spinning time of that thread.
Atomic operations (read-modify-write) are broken into LOAD, modify (one or more instructions such as ALU instructions), and STORE instructions. This STORE instruction is given a special CET code that distinguishes it from a regular store instruction so it can be used to trigger a simulator's coherency protocol.
CET Code Generator
In this phase, the profiled image is refined and its instructions are replaced by the corresponding CET instructions' formats. This includes processing the information regarding each instruction collected during the profiling phase. For example, in the case of a branch instruction, the loop detection algorithm is applied to check whether the instruction might be a loop or not. In the case of unconditional jump, it is checked whether this instruction jumps always to the same target or not and hence assigns the appropriate opcode for it. In case of load/store, it is checked whether it accesses a contiguous or noncontiguous block of data. The check for load/store can be performed earlier at the profiling phase (e.g., after the instruction references 50 or any specified number of memory addresses). This saves time and space. Figure 6 shows the CET code generation algorithm.
To illustrate the proposed trace compression methodology, Figure 7 shows a generated CET trace for a small C-code snippet (a loop to find the maximum of a 1-million-integer array). For this simple example, the compression ratio is approximately one-millionth (i.e., 0.000001).
Emulator and CET Data Generator
This phase generates the CET data in files with proper sequential order (i.e., a FIFO). Hence, when the CET code is processed (e.g., by a timing simulator), data required by any CET instruction can be consumed from the corresponding CET data file sequentially in the proper order in which they are needed. The other purpose of this step is to test the correctness of the CET code and report Efficient Generation of Compact Execution Traces 27:13 bugs, if any. It should be noted that this step practically eliminates any need to decompress or reconstruct the original trace. Figure 8 shows the emulation and CET data generation algorithm.
EXPERIMENTAL RESULTS

Experimental Setup
The proposed methodology was evaluated using a wide range of benchmarks that included a subset of Splash-2 (Ohara et al. 1995 ), PARSEC (2007 ), MediaBench I (1997 , and SPEC CPU 2000 (Standard Performance Evaluation Corporation 2000). The generated CET code was compared to the best trace compression techniques (Chen et al. 2013; Milenkovic and Milenkovic 2007) , as well as TCgen compressed traces. For the SPEC CPU 200 benchmarks, only the first and 51st billion instructions were used to be able to compare with SBC, as was done in Milenkovic and Milenkovic (2007) . Table 2 lists the used benchmarks with their input sets. Experiments on all benchmarks except the 27:14 A. Hroub et al. SPEC CPU 2000 were run on an Intel Xeon CPU E5-2680 machine (similar to the one used in Chen et al. (2013) ). For the SPEC CPU 2000 benchmarks, experiments were run on a Pentium 4 machine similar to the one used in Milenkovic and Milenkovic (2007) . The CET tool has been evaluated in two modes: (1) Instruction Addresses (IA)-only mode in which the baseline trace entry consists of the instruction along with its address (i.e., 32-bit instruction address, 32-bit instruction), and (2) full mode, in which the whole trace, instructions, instructions' addresses, and data addresses (if any) are compressed (i.e., 32-bit instruction address, 32-bit instruction [32-bit data address]). The IA-only mode was included for comparison with Chen et al. (2013) . All 23 traces that were used to evaluate our CET tool were also compressed using TCgen.
Two metrics were used to evaluate the CET tool: (1) compression ratio, the size of the uncompressed execution trace over the size of the generated CET trace (CET code + data), and (2) generation/compression/decompression speed in MIPS (millions of an execution trace's instructions generated/compressed/decompressed per second). In IA mode, each trace element (executed instruction) in the uncompressed trace is 64 bit, (32 bits for the instruction address and 32 bits for the instruction itself), whereas it is 96 bits in the full mode; an extra 32 bits are added to represent the data address, if any. Though generation/compression speed are evaluated, they are less important than the compression ratio and decompression speed since an execution trace is generated/compressed only once but is used many times.
Accuracy Evaluation
The accuracy of the generated CET code has been validated by comparing the number of instructions executed natively (under PIN instrumentation) for several benchmarks against those The first and 51st billion instructions of the reference input set as in Milenkovic and Milenkovic (2007) emulated in our tool suite (i.e., the last step in our methodology). Table 3 summarizes the accuracy results. The differences are broken down to total number of instructions, load instructions, store instructions, and other instructions (that do not include threading/synchronization instructions). All thread and synchronization events (thread creation/termination, synchronization barriers, locks/unlocks, etc.) were identical between native execution and CET emulation. The table shows the accuracy results for one thread and 16 threads for all benchmarks. The relative differences were calculated as 100% × (native execution value ̶ emulated CET value) / native execution value. These results show that there are no systematic differences ( + /-differences), and the largest difference in total number of instructions was ∼0.01% (in the 16-thread Radix benchmark). It should be noted that these differences are not necessarily errors. Upon careful examination of both original and CET instructions, we discovered that these differences are mainly due to two reasons: (1) the many ways x86 compilers encode loops and (2) the way our loop detection algorithm works. The loop detection algorithm sometimes makes a +1 or -1 error in the inner loop's iteration counters. Hence, benchmarks with more inner loops incur more errors. This is also evident from the fact that, for the same benchmark, as the number of threads increases, the errors also increase since loops are basically duplicated among threads and the loop counters' values are divided by the number of threads. The second reason is related to the way x86 instructions are translated to CET code. In order to make the CET code compatible with general load/store architectures, x86 memory-register instructions are translated to instruction pairs: Load-ALU or ALU-Store instructions, which increase the number of CET instructions relative to the original instructions. Figure 9 shows the compression ratio for the two modes. In general, IA mode has a higher compression ratio than the full mode, because IA mode ignores data memory references. Thus, the generated CET trace in IA mode does not include data addresses, which are often the largest component of the compressed trace. However, full mode can achieve a higher compression ratio when the application has few noncontiguous load/store addresses, such as ocean and blackscholes benchmarks. This is because the generated CET trace is nearly the same for the two modes, but the original execution trace is larger in the full mode. Figure 10 shows the compression ratio versus different problem sizes (small, medium, and large) of three different single-threaded benchmarks. From this figure, it is obvious that for the swaptions and blackscholes benchmarks, the compression ratio is nearly constant for the three aforementioned problem sizes. However, it decreases when the problem size is increased for the bodytrack benchmark. Increasing the problem size increases the uncompressed trace size. However, the effect of increasing the problem size on the generated CET trace size depends on the application structure, that is, the distribution of the noncontiguous addresses or dynamic unconditional jumps across the application. Thus, if the compressed trace size increases at the same rate as the uncompressed one, the compression ratio is sustained. Otherwise, the compression ratio might increase or decrease due to increasing the problem size. The compression ratio achieved by the CET tool varies according to the application's structure, since it affects the content of the CET data. For example, a large number of noncontiguous memory addresses, dynamic function calls, dynamic unconditional jumps, large number of conditional branches inside loop bodies, and so forth result in larger CET data and therefore lower compression ratio, and vice versa. Table 4 shows the original trace sizes, compressed trace sizes, and compression ratios achieved by the CET tool in comparison with the most notable compression methods in the literature: a compressor generated by TCgen, Ching-Wen Chen's (Chen et al. 2013) , and SBC (Milenkovic and Milenkovic 2007) . For the CET, results are shown for 23 single-threaded benchmarks for the two modes: full compression and instructions-only modes. Single-threaded benchmarks were used for comparisons with the other techniques' published results. Although the traces we used (x86 ISA) differ from Ching-Wen Chen's and SBC's, comparing the compressed trace sizes is still meaningful since these techniques do not retain the original trace addresses. As for the TCgen compressor, it was run on the same exact traces that were used with the CET tool. Our CET tool outperforms Ching-Wen's technique, which achieved a compression ratio between 32.3 and 47.6. Ching-Wen's technique has the same baseline trace as our IA mode. This table shows that the CET tool in IA mode achieved a better compression ratio than Chen's technique by at least one order of magnitude. Moreover, in the full mode, the CET tool is still better by at least one order of magnitude for most of the benchmarks. The CET tool does not have any case worse than Chen's technique. The CET tool in IA mode outperforms Chen's technique because it handles the instruction addresses in a different manner. Their compressed trace contains a very-long-bit vector, one bit per instruction, to indicate whether the current instruction's address is sequential or not. It also includes the differences among the nonsequential instruction addresses. On the other hand, the CET compressed trace captures the program flow control, and hence, when the CET code is executed, the instruction addresses are regenerated on the fly. CET compression also outperformed the TCgen compressor by orders of magnitude. The SBC compression ratios in Table 4 were obtained from Milenkovic and Milenkovic (2007) for two trace segments: first and 51st billion instruction segments. The authors also mentioned that the average segment size was 11GB; hence, the uncompressed traces averaged 22GB in total size, which is very close to our trace sizes (∼22.4GB) for the same trace segments. The compressed trace sizes were estimated by dividing the average trace size by the reported compression ratio. While this is not accurate, it serves to estimate the order of the compressed trace size to have a meaningful comparison with CET.
Compression Ratio
Unlike the CET tool, SBC was not run on a fully detailed input trace but only a 38-bit Dinero (Edler and Hill 1998) memory trace. These traces were divided into two files each, one for instruction addresses and another for data addresses, and the 2-bit header stripped out (that identified whether the address was an instruction, data load, or data stored). This of course reduced the trace sizes by over 470MB of otherwise noncompressible data. It was not clear, however, how that would affect reconstruction of the original trace when decompressed. Most of the savings in the SBC method are from the use of variable-length stride and repetition count fields, and by separating the instruction and data addresses into two files and removing the 2-bit header from the original trace. The variable-length encoding complicates the uncompressing process, especially if it is done in an HW simulator, where alignment of the compressed trace entries in memory becomes an issue. Furthermore, the removal of the uncompressible 2-bit header and separating the instructions and data addresses into two files, although saving hundreds of megabytes, means that the original trace may not be reconstructed with the original order of memory references (instructions and data), which in turn limits the technique's usefulness in architectural simulations. As Table 3 shows, for all considered benchmarks, CET's compressed files have similar sizes as those compressed by SBC. CET traces, however, not only have fixed-length fields but retain most of the original trace details (not just memory references) and do not require any further decompression. Figure 11 shows the full mode compression ratio of nine benchmarks for different numbers of threads, namely, one, two, four, eight, and 16 threads. In this experiment, the total uncompressed and compressed traces' sizes are the summations of the uncompressed and compressed traces' sizes of all threads, respectively. In most cases, the compression ratio remains nearly constant as the number of threads increases. Because the application is distributed on the available threads, the total uncompressed and compressed traces' sizes do not change markedly. However, the compression ratio decreases for the ocean benchmark. This variation is due to the variation of the CET data size, especially the number of noncontiguous addresses, when the number of threads changes. Table 5 shows a breakdown of the sizes of the CET code and different CET data components for several benchmarks in comparison with the original executable and data file sizes. It also list the number of loops per benchmark and the percentage of these loops with constant or contiguous data addresses (i.e., addresses that do not need to be stored in the CET data files). In addition to compressing instruction addresses, loops with such data references result in more compression since no data references need to be saved. The different CET data types were explained in Section 3.1. It should also be noted that these benchmarks generate huge datasets from initial small data inputs. As this table shows, the CET code sizes are very small and most of the compressed trace size is in its data. These data are of course very important to recreate the original execution events and their size varies depending on the nature of the benchmark. CET addresses, conditional branch results, and jump displacements are the most prominent components in the CET data.
Sources of Compression
The results in Table 5 give insight into the workings of the proposed trace generation/ compression methodology. Applications with regular loops that access the same data or regularly referenced data would achieve the highest compression. This is evident from the small sizes of the CET data files, except the branch result file. For example, for the Ocean benchmark, the branch results data file is relatively large, indicating a high number of conditional branching. Yet, the other three data files are small, indicating regular instructions/data references and hence the large compression ratio. On the other hand, benchmarks with large CET addresses and/or irregular branch target address (i.e., Jump-M) files achieved the lowest compression even if they contained many loops with regular data references (e.g.m Water-sp,-nsq, and Cjpeg benchmarks). Figure 12 and Figure 13 show the generation/compression and decompression times for the CET and SBC techniques. The CET tool was run as a single thread on a similar machine to the one reported in Milenkovic and Milenkovic (2007) . It should be noted that for the CET, the decompression time is actually part of the trace generation time also (to produce ready-to-use CET code). Though SBC has a near-constant compression time, the CET tool is faster than SBC by orders of magnitude. This is due to the variable field-length encoding/decoding of the SBC technique, which requires manipulating large linked lists. The CET's decompression stage simply executes the CET code, and once a CET datum is required, it will be ready on the front of the corresponding FIFO, that is, no complex decoding steps. Though all benchmarks had the same number of instructions, the CET generation time varies depending on the nature of programs since it uses dynamic data structures (linked lists) during CET code and data generation. Table 6 shows the CET trace generation speed in MIPS. It should be noted that, as with all trace compression techniques, the trace generation time is several orders of magnitude larger than the actual execution time of the original applications/benchmarks. The average trace generation speed is 186.4 MIPS, while the maximum is 789.1 MIPS (for the Bodytrack benchmark). These results also show that decompression is much faster than generation. This is expected since decompression is simply decoding and executing the CET code. The generation speed depends on the benchmark's structure; for example, the longer the loop's chains and addresses' lists are, the slower the generation. Figure 14 shows the compression time (normalized to that of a single thread) for five benchmarks as a function of the number of threads. As can be seen, compression time does not increase significantly with the number of threads. In fact, for some benchmarks, it decreases with the number of threads due to the smaller data structures that have to be handled. As mentioned earlier, the compression time depends on the total size of the application and more importantly the nature of the application.
Compression/Decompression Speed
CONCLUSIONS AND FUTURE WORK
In this work, a new method for generating compact traces of multithreaded applications is proposed. Such traces can be used in trace-based simulation of multicore architectures. The proposed on-the-fly generation method has several novelties over existing trace generation and compression techniques, namely, retention of most execution event records, capturing multithreading-related synchronization events, no need for decompression (the generated trace can be directly interpreted/executed by a simulator), and compatibility with both SW-and HW-based simulators. A complete tool suite has been created to implement and evaluate the proposed method. On several benchmarks, experimental results showed that, compared to other methods, the proposed method can achieve a high compression ratio at a reasonable time while retaining good fidelity. Moreover, it had good scalability when the problem size and the number of threads were increased. Scalability can be further enhanced in the future by implementing interthread compression (extracting and compressing common areas of threads). This, however, has proven to be a challenge under the constraint of the no-decompression requirement.
