Abstract. Instruction and data address traces are widely used by computer designers for quantitative evaluations of new architectures and workload characterization, as well as by software developers for program optimization, performance tuning, and debugging. Such traces are typically very large and need to be compressed to reduce the storage, processing, and communication bandwidth requirements. However, preexisting general-purpose and trace-specific compression algorithms are designed for software implementation and are not suitable for runtime compression.
Introduction
Instruction and data address traces are invaluable for quantitative evaluations of new architectures as well as for workload characterization, performance tuning, testing, and debugging. Two major issues are trace collection and storage. To offer a faithful representation of the system workload or to capture program behavior in real-world conditions, traces are needed from programs that run for seconds or even minutes on real machines. Hence, trace files tend to be very large and difficult to use and distribute. To reduce their size, they are typically compressed using general-purpose compression algorithms such as Ziv-Lempel (gzip) [1] , the Burroughs-Wheeler transformation (bzip2) [2] , or Sequitur [3] . Whereas these algorithms offer good compression ratios, more efficient compression is possible when the specific nature of redundancy in traces is taken into account.
Trace-specific compression techniques can be broadly classified in two groups, depending on whether they compress only instruction traces or traces including both instruction and data address information. Instruction traces can be compressed either by replacing an execution sequence by its identifier [4] [5] [6] [7] or by exploiting control-flow graph information [8, 9] . Combined instruction and data address traces can be compressed by recording only offsets from previous trace records of the same type [4, 10] , by linking data addresses to the corresponding dynamic basic blocks or loops [11] [12] [13] [14] , or by regenerating values using abstract execution [9, 15] or prediction [16, 17] . Compression of more complex trace records can exploit trace locality by storing relevant values in a cache-like structure so that a compressed trace consists of cache hit and miss information [18] .
Virtually all trace compression techniques target compression in software. However, some computer systems could greatly benefit from hardware support for trace collection and compression, such as emerging systems-on-a-chip with multiple embedded RISC and DSP processor cores. They present a formidable challenge to efficient debugging and performance tuning. For instance, ARM offers a module for tracing the complete pipeline information [19] . However, the existing compression techniques that can be efficiently implemented in hardware have poor compression ratios. For example, the ARM emulator compresses traces by replacing sequences of the same record by their repetition count [20] .
In this paper, we present a set of trace compression algorithms targeting on-the-fly compression of instruction and data address traces. The proposed algorithms strive to provide a good compression ratio while minimizing the required chip area for the trace compressor and the number of pins on the trace port. For the compression of instruction address traces we propose two new structures: stream caches and N-tuple history buffers. For the compression of data address traces we propose novel data address stride caches. Detailed experimental analyses based on full system simulations (i) prove the feasibility of runtime compression, (ii) show the proposed instruction address trace compressor to outperform gzip with minimal hardware cost, and (iii) demonstrate that the proposed data address trace compressor performs as well as gzip with relatively small structures. The compression ratio over all considered instruction address traces is 87.4 with gzip and 125.9 with a 128-entry stream cache and a 255-entry trace history buffer. The compression ratio over all considered data address traces is 6.78 with gzip and 6.16 with a 1024-entry data address stride cache. The total size of the stream compressor corresponds to 7629 bytes of on-chip memory.
The rest of this paper is organized as follows. Section 2 describes the architecture of the trace compressor and presents algorithms for instruction and data address trace compression. Section 3 discusses the results of the experimental analysis. Section 4 concludes the paper.
Instruction and Data Address Trace Compression
The proposed algorithms for instruction and data address trace compression are suitable for both software and hardware implementations. A software implementation may be used as an operating system plug-in for on-line compression or as a separate application for compressing already generated trace files. In this paper, we focus on hardware implementations. Our goals are (i) to minimize the size of the structures to reduce the chip area required for trace compression, (ii) to provide real-time compression so that the processor is never stalled, and (iii) to achieve a good compression ratio so that the trace port requires only a few external pins. Figure 1 shows the structure of the proposed trace compressor. The trace compressor receives instruction addresses (the program counter, PC), data addresses (DA), and task switch information from the processor core. The first level of the trace compressor encompasses an instruction stream cache (SC) and a data address stride cache (DASC). The output from this level consists of four components: the stream cache index trace (SCIT), the stream cache miss trace (SCMT), the data address trace (DT), and the data address miss trace (DMT). Redundancy in the output traces can be further exploited with an optional second-level compressor that features N-tuple compression for the SCIT trace component and data repetitions -a simple finite state machine that compresses repetitions in the DT stream. The final streams are forwarded to a trace output controller that manages the output of the logical trace streams (synchronize, pack, add header) and interfaces with the external trace unit through the trace port pins akin to the ARM trace funneling [19] . Internal buffers ensure that the trace compression proceeds without stalling the processor and without dropping data.
Instruction Address Trace Compression
Instruction trace compression exploits temporal and spatial locality in instruction streams [14] . An instruction stream is defined as a sequential run of instructions, from the target of a taken branch to the first taken branch in the sequence. Previous studies show that most programs generate only a small number of unique instruction streams. For example, the average instruction stream length is about 12 instructions for the SPEC CPU2000 integer applications and about 117 instructions for the floating-point applications, with a maximal length of 3162 instructions and a minimal length of one instruction [14] . The starting address (SA) and length (SL) uniquely identify an instruction stream.
To compress an instruction address trace, we detect instruction streams and replace each of them with an identifier, which is similar to the SBC trace compression technique [14, 21] . Instruction streams are detected as described in Figure 4 using very simple hardware ( Figure 2 ). SA and SL are placed in the instruction stream buffer, which is a FIFO structure that buffers possible bursts of short instruction streams. S.SA and S.SL are read from the instruction stream buffer and a stream cache lookup is performed ( Figure 4 tion. In case of a stream cache hit, the corresponding stream cache index (concatenated iSet and iWay indices) is emitted to the SCIT. In case of a cache miss, the reserved index 0 is emitted to the SCIT, and the stream descriptor (S.SA and S.SL) is emitted to the SCMT. The algorithm then deterministically selects a cache entry to be replaced, and the selected entry is updated with the stream descriptor. The compression ratio achieved by the stream cache compression, CR(SC.I), is defined as the ratio of the raw instruction address trace (Itrace) size, calculated as the number of instructions multiplied by the address size, and the sum of the sizes of the output traces SCIT and SCMT (Eq. 1). It can be expressed analytically as a function of the average dynamic stream length (SL.Dyn), the stream cache hit rate (SC.Hit), and the stream cache size (N SET *N WAY ) (Eq. 2). For each instruction stream, log 2 (N SET *N WAY ) bits are emitted to the SCIT output. On each miss in the stream cache, 5 bytes are emitted to the SCMT output, assuming 1-byte stream lengths and 4-byte addresses.
Typically, we see high stream cache hit rates due to the small number of unique instruction streams and the high temporal locality of the streams. Consequently, the size of the compressed trace is predominantly determined by the size of the SCIT output. The SCIT output trace is highly redundant because the majority of the runtime is spent in critical portions of the code that often encompass short sequences of instruction streams. To further exploit this redundancy with small hardware resources, we employ N-tuple compression. Get the next stream from the instruction stream buffer (S.SA, S.SL); 2.
Perform lookup in the stream cache with iSet = F(S.SA, S.SL); 3.
if (hit) 4.
Emit <iSet, iWay> to SCIT; 5.
else { 6.
Emit reserved value <0> to SCIT; 7.
Emit stream descriptor <S.SA, S.SL> to SCMT; 8.
Select an entry (iWay) in the iSet set to be replaced; 9.
Update Get the next index from the SCIT stream 2.
if (N-tuple incoming stream buffer is full) { 3.
Perform lookup in the Tuple History Buffer (THB); 4.
if (hit) { 5.
Emit <index in the THB> to the Tuple.Hit trace; 6.
// emit the first index found in the buffer 7.
} else { 8.
Emit <0> to Tuple.Hit trace; 9.
Emit <N-tuple> to Tuple.Miss trace; } 10.
Update the Tuple History Buffer; } 
Data Address Trace Compression
Unlike instruction addresses, data addresses (of memory referencing instructions) rarely stay constant during program execution [22] . However, they often have a regular stride. Our proposed algorithm for runtime data address trace compression exploits temporal locality of memory referencing instructions and regularity in data address strides. The data address trace compression utilizes a data address stride cache (DASC). The DASC is a tagless direct mapped cache-like structure, where each entry consists of two fields: a last data address (LDA) and a stride field ( Figure 5 ). The data address trace compression algorithm is The compression ratio achieved by the data address trace compression, CR(DASC.D), is defined as the ratio of the raw data address trace (Dtrace) size, calculated as the number of memory referencing instructions multiplied by the address size, and the sum of the sizes of the output traces DT and DMT (Eq. 3). It can be expressed analytically as a function of the data address stride cache hit rate (Eq. 4). For each memory referencing instruction a single bit is emitted to the DT. On each miss a 4-byte address is emitted to the DMT. A generalized set-associative organization of DASC promises even better stride hit rates and consequently better compression ratios. However, the set-associative DASC requires address tags to be kept, which increases hardware complexity. Hence, we do not consider such DASCs in this paper. A simple state machine detects repetitions in the DT output and replaces repeating patterns with a <pattern, number of repetitions> pair.
Experimental Evaluation and Results
The goals of the experimental evaluation are (i) to assess the effectiveness of the proposed compression algorithms and (ii) to explore the feasibility of the proposed hardware implementations. We compare the compression ratio of the proposed algorithms to the compression ratio achieved by the general-purpose compression algorithms in the gzip (fast, default, best) and bzip2 (best) software utility programs. To explore the design space of the hardware trace compressor, we extended the SimpleScalar simulator [23] to support the proposed runtime trace compression algorithms. As workload we use complete runs of 16 MiBench programs. Table 1 shows the benchmark characteristics, including the number of instructions executed (IC), the number of unique streams (NUS), the maximum stream length (max.SL), and the average dynamic stream length (SL.Dyn). This table reveals that the number of unique instruction streams is relatively small. The average stream length ranges from 5.61 in stringsearch to 54.6 in adpcm_c.
Instruction Address Trace Compression
The compression ratio for instruction address traces depends on application characteristics (such as the average stream length and the temporal locality of the instruction streams) and the stream cache parameters. To evaluate the impact of the stream cache size and organization, we vary the number of entries from 8 to 256, and the number of ways from 1 to 8. Table 2 shows the average stream cache hit rate and the total compression ratio (the sum of the raw instruction traces for all applications divided by the sum of all compressed traces). The results indicate that even very small stream caches can achieve a good compression ratio. For example, the 16x4 (16-set and 4-way) stream cache achieves an overall compression ratio of 44.1, i.e., about 80% of the compression ratio achieved with the 32x4 stream cache, which is twice as complex. Increasing the associativity of the stream cache improves the compression ratio. Even though the 16x8 stream cache yields the best overall compression ratio of 57.4, the 32x4 represents the best price-performance tradeoff. We have tested several mapping functions and S.SA<5+ne:6> xor S.L<ne-1:0> performs the best, where ne=log 2 (N SET *N WAY ). The chosen stream cache organization achieves a better compression ratio than gzip with the "fast" option on the raw instruction traces (Table 3) .
N-tuple compression can further compress the SCIT trace. We consider a 32x4 stream cache and a 255-entry 8-tuple history buffer. Table 3 shows the compression ratio for the following algorithms: stream cache compression only (SC.I), combined stream cache and N-tuple compression (SC.I+Ntup), gzip (default, fast, best), and bzip2 (best). The combined SC.I+Ntup outperforms gzip even with the "best" option, yet it can be performed in real time with small on-chip hardware structures. It only requires a bandwidth of 0.25 bits per executed instruction on the trace port.
Data Address Trace Compression
The compression ratio for data address traces depends on program behavior (the number of memory referencing instructions and their locality) and the size and organization of the DASC structure. We vary the size of the DASC from 128 to 1024 entries. Table 4 shows the compression ratios for data address trace compression for different DASC structures as well as the compression ratio achieved by gzip (fast, default, best) and bzip2 (best) on the raw data address traces. The results indicate that increasing the number of entries is beneficial. The 1024-entry DASC achieves a compression ratio of 6.12, which is higher than that of fast gzip, but slightly lower than that of default and best gzip. The tagged DASC with the same number of entries, organized as a set-associative structure with 256 sets and 4 ways, achieves a compression ratio of 6.6, which is as good as default gzip. This translates into a bandwidth of 0.26 bits per executed instruction on the trace port. A 256-entry DASC requires 0.4 bits/instruction. 
Hardware Complexity
So far we have shown that the proposed algorithms indeed achieve a good compression ratio ensuring that a small trace port would suffice. In addition, the compressed output traces are suitable for further compression in software, which allows the design of external trace units that can capture traces over prolonged periods of time for experimental systems (the results are not shown due to page limitation). The simple hardware structures guarantee low latency of the proposed compression. To verify that we can perform runtime compression without stalling the processor, we extended the SimpleScalar full system simulator to support our runtime compressor. In addition to verifying the feasibility of the proposed system, this simulator is used to determine the minimal necessary depth of the instruction stream buffer (Figure 2 ) and the data address buffer (Figure 1) . We assume that the stream cache latency is 1 clock cycle for hits and 2 clock cycles for misses. The DASC latency is 2 clock cycles for both hits and misses. The modeled processor corresponds to the XScale processor. The results indicate that the instruction stream buffer needs only 2 entries, while the data address buffer needs 8 entries. Table 5 provides an estimate of the hardware complexity of the proposed structures. The overall size corresponds to 7629 bytes, which is several times smaller than L1 processor caches, giving further evidence that the structures can operate at CPU clock frequencies.
Conclusion
This paper presents a set of algorithms for runtime compression of instruction and data address traces. Based on these algorithms we propose an on-chip hardware compressor capable of unobtrusive real-time instruction and data address trace compression. It achieves excellent compression ratios, comparable to general-purpose compression in software, at minimal hardware complexity. 
