Debugging parallel programs is a well-known difficult problem. A promising method to facilitate debugging parallel programs is using hardware support to achieve deterministic replay on a Chip Multi-Processor (CMP). As a Design-For-Debug (DFD) feature, a practical hardware-assisted deterministic replay scheme should have low design and verification costs, as well as a small log size.
INTRODUCTION

Motivation
Parallel programming is probably the best way of exploiting the computational ability of a Chip Multi-Processor (CMP). However, debugging parallel programs is notoriously difficult. For example, the nondeterminism of parallel programming may introduce some "Heisenbugs" that exhibit erroneous behaviors under certain conditions but disappear when in probing or attempting to isolate it. A promising method to tackle the preceding nondeterminism is deterministic replay. A deterministic replay scheme continually records the nondeterministic factors in the original execution (production run) of a parallel program as logs, and utilizes the logs to force another execution (replay run) exhibiting the same behavior as the production run. During the last two decades, deterministic replay has been employed in different applications (e.g., debugging parallel programs [Bacon and Goldstein 1991] , fault tolerance [Hower and Hill 2008] , performance prediction, and intrusion detection [Dunlap et al. 2002] ). Due to the broad applications, deterministic replay has received intensive investigations by the computer architecture community and parallel computing community.
In general, previous investigations on deterministic replay mainly focus on dealing with the uncertainty of memory races, that is, the execution order between a pair of successive conflicting memory instructions accessing the same memory location. In production run, they dynamically identify and record the logical time orders between instructions (or instruction blocks), which is the transitive closure of program and execution orders; in replay run, they use the execution orders inferred from the recorded logical time orders to constrain the execution of the parallel program. According to the concrete methodology of implementing deterministic replay, there are mainly two categories of schemes: software-only schemes and hardware-assisted schemes.
Software-only schemes purely rely on the supports of specific system software environments (e.g., operating system, virtual machine, and library) to deal with execution orders [Altekar and Stoica 2009; Leblanc and Mellor-Crummey 1987] . However, owing to the restrictions imposed on system software, these schemes often pay remarkable performance loss in production run. Different from software-only schemes, hardwareassisted schemes pursue dedicated hardware support to efficiently record and replay execution orders [Bacon and Goldstein 1991; Devietti et al. 2009; Hower and Hill 2008; Montesinos et al. 2008 Montesinos et al. , 2009 Narayanasamy et al. 2005 Narayanasamy et al. , 2006 Voskuilen et al. 2010; Xu et al. 2003 Xu et al. , 2006 . Benefiting from the dedicated hardware support, hardwareassisted schemes often introduce relatively smaller performance loss of production run than software-only schemes.
Owing to the attractive feature of hardware-assisted deterministic replay, it is quite possible that a future commercial CMP may integrate some hardware-assisted deterministic replay scheme as a Design-For-Debug (DFD) functionality. Nevertheless, the industrial guidelines of DFD, obeyed by most practical DFD techniques [Abramovici et al. 2006 [Abramovici et al. , 2008 Carbine and Feltham 1998; CoreSight 2011; Foster et al. 2007; Hsieh and Huang 2008; Huott et al. 1999; Josephson 2006; Kao et al. 2007; Livengood and Medeiros 1999; Pyron et al. 2002; Silas et al. 2003; Zilmer 1999 ] may have strict requirements on the following aspects.
-Most importantly, the DFD functionality should not affect the performance of the normal functionality in production run. To the best of our knowledge, few commercial chips pay remarkable performance cost to facilitate debugging either hardware or software. -The DFD functionality should be decoupled from the normal functionality of the chip, otherwise the intensive interactions between normal and DFD functionalities may increase the risk of bugs. For example, the well-known EJTAG standard proposed by MIPS [Zilmer 1999 ], which is a DFD standard to facilitate debugging low-level software on MIPS processors, has its own datapath independent of the normal memory hierarchy. Moreover, many DFD techniques have their own (yet simple) network-on-chip to avoid disturbing the normal network-on-chip [Abramovici et al. 2006 [Abramovici et al. , 2008 Hu et al. 2009 ]. -The log size of DFD functionality should be small enough to fit the limited debugging I/O bandwidth. Although one can save the debugging log into the memory hierarchy for normal functionality, it may affect the performance of normal functionality. Furthermore, when a CMP crashes or meets deadlock, there is a risk of losing all debugging information saved in the normal memory hierarchy. Due to the preceding reasons, many processors employ JTAG [IEEE Std. 1149 .1-1990 as the dedicated I/O to transfer debugging information, whose bandwidth is only less than 5MB per second [CoreSight 2011; Zilmer 1999 ]. -The DFD functionality should not rely on an infeasible feature that does not suit commercial chips. For example, a practical deterministic replay scheme should not assume sequential consistency, since few commercial CMPs adopt sequential consistency. -The hardware cost of DFD should be moderate. It is hard to imagine that the architect of a commercial CMP would like to pay a large portion of chip area (say, 20%) just for a DFD functionality, since the area of a chip significantly affects its price and marketing.
Few hardware-assisted schemes meet all aforesaid requirements simultaneously. Many previous schemes still introduce considerable performance loss, or modify critical components of a CMP (e.g., processor core, L2 cache, cache coherence protocol, and interconnection network). Besides, the logs of many previous schemes are too large to be transferred with a common debugging I/O. In addition, many previous schemes are evaluated on CMP simulators with the sequential consistency model, which is not adopted by any state-of-the-art CMP. Admittedly, the industrial DFD guidelines are somewhat too stringent for hardwareassisted deterministic replay. In industry, it is still possible for designers to implement hardware-assisted deterministic replay with some relaxation of these guidelines, in order to address the ever-increasing demand of parallel programming/debugging convenience. However, to bridge the gap between academia and industry, it is quite meaningful to take the DFD guidelines on board as much as possible when proposing a hardware-assisted deterministic replay scheme.
Our Idea
In this article, we propose a novel hardware-assisted deterministic replay scheme to meet the requirements of DFD. The key innovation is that in the presence of a global clock, we can cost effectively record the pending period information of instructions (for replaying memory orderings) without interfering the normal functionality of a CMP. Here, the pending period of an instruction, defined by Chen et al. [2009a Chen et al. [ , 2009b , is a time interval on the global clock in which the instruction starts and ends 1 . Naturally, two instructions with disjoint pending periods can be ordered by the so-called physical time order 2 [Chen et al. 2009a [Chen et al. , 2009b . It has been proven that execution orders cannot violate the physical time orders and pending period information [Chen et al. 2009a, 1 Meanwhile, the performing time of an instruction, which is the time when a memory instruction is observed by all processors, must be in the pending period of the instruction. 2009b], which allows us to infer execution orders from the recorded pending period information.
An illustrative example of inferring execution orders from pending period information is presented in Figure 1 . Two processors, P 1 and P 2 , execute instructions u and v (with subscripts) respectively. On the global clock, all instructions in the upper block finish before the time 200, and all instructions in the lower block start after the time 220. According to the definition of physical time order, any instruction in the upper block must precede any instruction in the lower block in physical time order. Consequently, among the execution orders represented by the arrows in Figure 1 , most of them are inferrable from physical time orders, except the execution order u 126 E − → v 214 (the bold arrow in Figure 1 ). According to our experiments, about 99% execution orders are inferrable from the pending period information.
After indirectly recording most execution orders through recording the pending period information, the remaining problem is how to identify and record the residual (about 1%) noninferrable execution orders. Different from the conference version of this study ] that records all noninferrable execution orders, this extended version proposes a simple yet effective technique named direction prediction technique to record the noninferrable execution orders with even smaller log size. This technique is inspired by the key observation that the directions of execution orders are often predictable. Hence, we use a default direction as the predicted direction of a noninferrable execution order. In production run, only those noninferrable execution orders contradicting the default should be recorded. In replay run, we replay instruction blocks according to the default orders until a recorded nondefault noninferrable execution order is met.
Benefiting from the direction prediction technique which significantly reduces the log size of noninferrable execution orders, we can employ low-cost techniques (e.g., longer sample period, Bloom-filter [Bloom 1970 ]-based identification) to inaccurately identify noninferrable execution order. At the cost of introducing some false positives for noninferrable execution orders, the area cost of LReplay can be reduced by 60% in comparison with the conference version of this article .
By integrating the preceding techniques, we propose a hardware-assisted deterministic replay scheme called LReplay, which has been implemented and evaluated on the RTL design of a commercial CMP Godson-3 [Hu et al. 2009] . For the existing components of the CMP, LReplay requires adding only a few registers for observing the state of the core, while keeping other components (e.g., L2 cache, memory controller, cache coherence protocol, and network-on-chip) unmodified. Such trivial modifications incur neither performance penalty nor regressive verification cost. The experimental results show that LReplay needs about 0.05Byte per Kilo-Instruction (B/K-Inst) for its Pending Period Log (PPL), and 0.12B/K-Inst for its Noninferrable Execution order Log (NEL). Under sequential consistency, the overall log size of LReplay over SPLASH-2 benchmarks (0.17B/K-Inst) is not only smaller in an order of magnitude than previous deterministic replay schemes, incurring no performance loss in production run, but also 70% less than the log size of the original version of LReplay (0.55B/K-Inst) .
Furthermore, the conference version of LReplay is the first attempt to evaluate hardware-assisted deterministic replay on a system with a relaxed memory consistency model (i.e., Godson-3 consistency). Dedicated to recording the additional information for the relaxed consistency model, LReplay needs additional 0.40B/K-Inst to record the load instructions violating sequential consistency in the Memory Consistency Log (MCL). In summary, the overall log size of LReplay over SPLASH-2 benchmarks is 0.57B/K-Inst for Godson-3 consistency.
Our Contributions
The main contributions of LReplay can be summarized as the following.
-As the first deterministic replay scheme built upon the global clock, LReplay can be easily implemented on different industrial designs, since it can support different memory consistency models, and the required global clock has been a common component implemented in most (if not all) state-of-the-art CMPs. Benefiting from the pending period information infused by the global clock, the log size of LReplay is smaller in an order of magnitude than previous deterministic replay schemes, incurring no performance loss in production run. -We propose the direction prediction technique to further prune the log spent on recording noninferrable execution orders, and the log size of LReplay is further reduced by 70% (from 0.55B/K-Inst to 0.17B/K-Inst) compared with the conference version of LReplay ]. -LReplay is a hardware-assisted deterministic replay scheme fulfilling the industrial DFD guidelines. It has no performance impact on normal functionality, and is fully decoupled from normal functionality; its log is small enough to suit a common debugging I/O; it is implemented and evaluated on a system with a relaxed memory consistency model; its area cost is only 0.5%. These notable features show the potential of integrating hardware-assisted deterministic replay into future industrial processors.
The rest of the article is organized as follows. Section 2 introduces the theoretical basis of LReplay including global clock, pending period, physical time order, and so on. Section 3 presents the implementation details of LReplay. Section 4 evaluates LReplay via an experimental study. Section 5 briefly reviews some related work. Section 6 concludes the whole article.
THEORETICAL BASIS
The logical clock was proposed by Lamport to order events in distributed systems without physical global clocks [Lamport 1978] . Over the past decades, multiprocessor systems were often treated as variations of distributed systems. Hence, previous investigations on multiprocessor systems inherited Lamport's logical lock and logical time order for distributed systems. Admittedly, when the logical time order information is perfect, the logical behavior of a multiprocessor system can be determined. However, under many circumstances, it is both hard and ineffective to directly observe and record all the logical time orders, while conjecturing and inferring the unrecorded logical time orders may face high time complexities [Gibbons and Korach 1994; Netzer and Miller 1990] . Fortunately, due to the emergence of CMP techniques, the physical distance between processor cores is continually scaling down, and has been short enough to allow the existence of a physical global clock with tolerable inaccuracy. In industry, most (if not all) industrial CMPs, including Westmere-EX of Intel [Sawant et al. 2011 ], Opteron of AMD [Dorsey et al. 2007 ], SPARC of SUN [Nawathe et al. 2007 ], POWER7 of IBM [Wendel et al. 2011 ], Godson-3 of ICT , have equipped the global clock to facilitate their physical designs. Even for many-core processors, a global clock with < 50ns inaccuracy is still applicable, since the chip size would be less than 1000mm 2 due to the limitation of package and power, and the clock signal can be transmitted faster than 20mm/1ns. As a piece of evidence, the 80-core Polaris of Intel [Vangal et al. 2007 ] adopts a global clock. In academia, researchers also realize the importance of the global clock in solving parallel problems [Chen et al. 2009a [Chen et al. , 2009b Goodstein et al. 2010; Hu et al. 2012] .
In the presence of a global clock, it is quite intuitive to order memory instructions by their performing times on the global clock. However, the precise performing time of an instruction is not only hard to observe (the global performing of an instruction may involve many cores), but also difficult to record (there may be billions of instructions performed in one second). To make the global clock practical for ordering events in multiprocessor systems, Chen et al. relaxed the precise performing time to a time interval named the pending period, which contains the performing time [Chen et al. 2009a [Chen et al. , 2009b . Consequently, two instructions with disjoint pending periods can be ordered by the sequence of their performing times. Such an order is called physical time order. Based on the aforesaid concepts, a series of techniques, which were named pending period analysis as a whole, were proposed to analyze multiprocessor systems with global clocks [Chen et al. 2009a [Chen et al. , 2009b .
In the remainder of this article, we assume that a global clock does exist in a CMP. We first briefly introduce the concepts of pending period and physical time order. After that, we introduce the relationship between physical time order and execution order to establish the theoretical basis of LReplay.
Preliminaries
Generally speaking, the concept of pending period can be traced to a common architecture concept called the instruction window. Before an instruction enters the instruction window of the corresponding processor, it cannot affect any other instruction. If an instruction has retired from the instruction window of a store-atomic processor 3 , it must have been viewed by all the processors in the system. Hence, an instruction is globally performed when it is in the instruction window of the corresponding processor. Inspired by the instruction window, we assign a start time and an end time on the global clock as the relaxations of performing time for each instruction. [Chen et al. 2009a [Chen et al. , 2009b ). 4 For an instruction u executed on a processor P, the start time t s (u) of u is a time point on the global clock which is no later than the time point when u enters the instruction window of P. The end time t e (u) of u is a time point on the global clock that is no earlier than the time point when u retires from the instruction window of P. The performing time t p (u) of u is in the time interval between t s (u) and t e (u) , that is,
Definition 2.1 (Start and End Time of Instruction
Based on the previous concepts of start time and end time, Chen et al. provided a definition of pending period.
Definition 2.2 (Pending Period [Chen et al. 2009a [Chen et al. , 2009b ). The pending period of instruction u is the time interval from its start time t s (u) to its end time t e (u).
Pending period information can be used to order instructions in many cases. Considering two instructions u and v with disjoint pending periods, where the pending period of u is earlier than the pending period of v, that is, t e (u) < t s (v) holds. It follows from
Thus instruction u is globally performed earlier than v. Such a partial order determined by the sequence of pending periods is called the physical time order (denoted by [Chen et al. 2009a [Chen et al. , 2009b ). For two instructions u and v, if the end time of instruction u is before the start time of instruction v, then we say that u is before v in physical time order. Formally,
( 1 )
Relationship between Physical Time Order and Execution Order
In this subsection, we investigate the relationship between physical time order and execution order. Before that, we briefly review the concept of execution order, which is a logical time order existing between two successive conflicting memory instructions. Here, we say that "two memory instructions are conflicting" if two instructions access the same memory location and at least one of them is a store instruction. The concrete definition of execution order is as given next.
Definition 2.4 (Execution Order).
We say that a store instruction w is before instruction u in execution order if and only if w is the latest store instruction before u that accesses the same location as u. We denote this as w E − → u. We say that a store instruction w is after instruction u in execution order if and only if w is the consecutive store instruction after u that accesses the same location as u. We denote this as u E − → w.
The execution orders are crucial since the logical behavior of a multiprocessor system completely depends on them, that is, two executions of one parallel program with identical execution orders have identical results. Following this basic idea straightforwardly, traditional deterministic replay schemes record sufficient logical time orders in production run, and manage to reproduce all execution orders in replay run.
Unlike execution orders that rely on traditional logical clocks, physical time orders are based on a distinct type of clocks called the physical global clocks. Chen et al. proved that the orders with respect to the two types of clocks are consistent [Chen et al. 2009a [Chen et al. , 2009b . As a corollary of their result, we have the following theorem. 
THEOREM 2.5 (INFERRING EXECUTION ORDER THEOREM). If u and v are successive conflicting instructions, and u is before v in physical time order, then u is before v in execution order. Formally,
PROOF. According to the definition of physical time order, t e (u)
Theorem 2.5 is the theoretical basis of our work. According to Theorem 2.5, from physical time orders (resulting from the pending period information) we can infer some execution orders. Consequently, in production run we can record physical time orders instead of many execution orders. Enforcing the recorded physical time orders in the replay run can indirectly guarantee the execution orders that are inferrable from physical time orders. Only the residual execution orders, which are noninferrable from physical time orders, need to be recorded directly in production run.
IMPLEMENTATION OF LREPLAY
Traditional hardware-assisted deterministic replay schemes often implant their hardware support into critical components of a CMP (e.g., processor core, L2 cache, cache coherence protocol, and interconnection network) to identify and record logical time orders. They bring nonnegligible design, verification, and performance costs, thus can hardly fulfill the industrial DFD guidelines.
In the presence of an on-chip global clock, we now propose a deterministic replay scheme called LReplay, which is a hardware-assisted scheme following the DFD guidelines. LReplay is implemented based on the RTL design of a commercial CMP Godson-3 [Hu et al. 2009 ], whose main features are listed in Table I . The key innovation of LReplay is that it elegantly records the pending period information of each instruction block 5 without intervening the normal functionality of the CMP. Since most execution orders can be deduced from the physical time orders infused by the pending period information, LReplay only needs to directly record the residual noninferrable execution orders. Moreover, the direction prediction technique employed by LReplay can further reduce the log size for recording noninferrable execution orders.
In the rest of the section, we first show how to effectively record pending period information. After that, we introduce a conservative yet effective way of dealing with the noninferrable execution orders. Then, we introduce the way of recording the additional log for relaxed memory consistency models (such as Godson-3 consistency [Chen et al. 2009a [Chen et al. , 2009b ). In addition, we introduce how to replay with the log recorded by LReplay. After that, we present the hardware support of LReplay, and explain how LReplay works with the hardware support.
Recording Pending Period
LReplay is a deterministic replay scheme based on pending period information. In this subsection, the tangible way of recording the pending period information is presented.
Intuitive Idea: A Pending-Period-Tight
Approach. An intuitive idea is to record as tight as possible the pending period of each memory instruction, which can infer almost all execution orders. However, this idea is far from perfect due to its large log size. More specifically, in a modern CMP executing more than 10 10 instructions per second, if it takes 8B (bytes) to record the start time and end time of each instruction, the log size of the pending period will exceed 80GB/s, which is quite a challenge. As a result, we must find some other effective way of recording the pending period information.
Practical Method: Relaxation of Pending Period.
To reduce the log size of recording pending period information, LReplay records the pending period of each instruction block instead of each individual instruction. A way of obtaining the pending period of the instruction block is to sample the precise value of the memory instruction counter periodically. For example, considering a sequential consistent system, we sample the memory instruction counter of a processor every 100 cycles. If the sampled value of the memory instruction counter is 50 at the time 100 and 120 at the time 200, the pending period of the instruction block from memory instruction 50 to memory instruction 120 can be assigned to the time interval [100, 200] . However, this is still inefficient since it consumes several bytes per sample period. In fact, LReplay periodically records the approximate increment of memory instruction counter to further reduce the log size for pending period information, as long as the deviation between the obtained information and the actual value of the memory instruction counter is bounded.
Concretely, as shown in Algorithm 1, the pending period sampler module of LReplay can record the approximate increment of the memory instruction counter at the jth sampling for processor P i as inc i,j in the Pending Period Log (PPL). Register mem inst cnt is the memory instruction counter of P i , which is increased by 1 if a memory instruction commits. Register mem cnt div is sampled every sample period cycles to record the quotient of mem inst cnt divided by sample period. Register pre mem cnt div holds the value of mem cnt div in the previous sampling (the (j − 1)th sampling). As a result, the value of the approximate increment inc i,j can be obtained through mem cnt div − pre mem cnt div. Then, inc i,j is recorded in PPL with Huffman encoding.
In fact, the PPL recorded by LReplay provides an underestimated approximation of the memory instruction count (i.e., rounding mem inst cnt downward by sample period). As a result, we can know a bound for the value of memory instruction counter on P i at the jth sampling. An example of recording the approximate increment of the memory instruction counter is illustrated in Figure 2 , where the memory instruction counter is sampled every 256 cycles for 10 times, that is, there are 256 memory instructions in each instruction block. The values of the memory integer counter (mem inst cnt in the figure) are 165, 278, 345, 510, 700, 828, 974, 1105, 1312 , and 1453 for the 10 samplings, respectively, while the values of the memory integer counter rounded downward by 256 (mem cnt rnd in the figure) are 0, 256, 256, 256, 512, 768, 768, 1024, 1280 , and 1280 for the 10 samplings, respectively. Therefore, we only need to record a log of 0100110110 for the 10 samplings. Moreover, recovering an approximate memory instruction counter from the log is rather straightforward. For example, the mem inst cnt for the 7th sampling should be in the interval [256(0+1+0+0+1+1+0), 256(0+1+0+0+1+ 1) + 255], that is, [768, 1023] ; the mem inst cnt for the 9th sampling should be in the interval of [256(0 + 1 + 0 + 0 + 1 + 1 + 0 + 1 + 1), 256(0 + 1 + 0 + 0 + 1 + 1 + 0 + 1 + 1) + 255], that is, [1280, 1535] . Hence, memory instruction 1024 does not start at the 7th sampling, and memory instruction 1279 has ended before the 9th sampling. Thus it can be concluded that the instruction block from memory instruction 1024 to memory instruction 1279 is in the pending period of [1792, 2303] , which is the time interval from the 7th sampling to the 9th sampling.
Algorithm 1 records the pending period information with less than k bit per processor per sample period, where k is the average number of memory instructions committed per processor per cycle. Obviously, when the sample period is long enough, the log size of PPL is quite tiny. However, a longer sample period will result in a larger amount of noninferrable execution orders (since a longer sample period leads to further relaxation of the pending period). As a consequence, the size of the Noninferrable Execution order Log (NEL) will increase. Furthermore, a longer sample period requires more hardware costs to identify the noninferrable execution orders. In this article, we find that 1024 sample period is an appropriate trade-off between PPL size and NEL size. In this case, when the IPC is about 1.25, the size of inc i,k information is only ∼ 0.10B/K-Inst. With Huffman encoding of inc i,k information, the log size of PPL can be even compressed to less than 0.05B/K-Inst.
Recording Noninferrable Execution Orders
Although the recorded pending period information can infer most execution orders, there are still some (around 1%) noninferrable execution orders existing between instructions without physical time order. It is crucial for LReplay to identify and record directly these noninferrable execution orders in production run.
At each time point when a memory instruction (say, u) misses in L1 dcache, LReplay identifies whether there is a noninferrable execution order about u. If there is an execution order from some store instruction w to u (w E − → u) 6 , in order to check whether this execution order is inferrable from the pending period information, LReplay inspects whether the pending period of w is overlapping with that of u. In a conservative manner, if the memory address of u does not hit the address of any recent store instruction in the recent sample period, then w and u do not have overlapping pending periods, and it can be decided that w E − → u is inferrable under the pending period information. Otherwise, the pending period of w may be overlapping with that of u, and the order w E − → u is noninferrable.
Identifying Noninferrable Execution Orders Based on Bloom
Filter. In our preliminary work ], we employ a CAM to record the store addresses in the current sample period for each processor core, and compare the address of u in the CAMs to check the inferrability of w E − → u. However, when we use a 1024-cycle sample period, LReplay requires a 2048-entry CAM, which introduces significant timing, power, and area costs. Hence, instead of employing a CAM, here we employ a 2240-bit signature deriving from the Bloom-filter-based hashing function [Bloom 1970 ] to save the addresses of the cache lines stored by the recent 2048 memory accesses. We can check the membership of u's address in the encoded address set represented by the signature so as to identify noninferrable execution order with a rate of false positive.
To guarantee the low false positive rate of the Bloom filter, we have to periodically clear the old addresses from the Bloom filter signature. Hence, we partition the 2240-bit signature into five groups, where each group contains 448 bits to save the store addresses of 512 memory accesses. The store addresses of new instructions are inserted into different groups iteratively. As the example illustrated in Figure 3 shows, before the 3.5K-th memory instruction comes, group4, group5, group1, and group2 contain 1:12 Y. Chen et al. Fig. 3 . The signature is divided into five groups. Before the 3.5K-th memory instruction comes, we flush all addresses in group3. Then, the store addresses of the 3.5K-th to (4K − 1)-th memory instructions are saved in group3.
the store addresses of the most recent 2048 instructions ([1.5K, 3.5K)). As a result, we can flush group3 to discard the store addresses of memory instructions [1K, 1.5K), since they have been out-of-date now. Then, the store addresses of memory instructions [3.5K, 4K) are saved in group3. Before the 4K-th memory instruction comes, group5, group1, group2, and group3 contain the store addresses of the most recent 2048 instructions ([2K, 4K) ). Consequently, we can flush the group4 (which contains the store addresses of out-of-date memory instructions [1.5K, 2K)). In this way, the store addresses of the recent 2048 memory accesses are always kept within the five groups of signatures in a pseudo first-in-first-out manner. As a consequence, the false positive rate of the Bloom filter is acceptable for identifying noninferrable execution order.
Pruning Noninferrable Execution Orders Using Direction Prediction.
LReplay needs to record the NonInferrable Execution Orders (NIEOs) after identifying them. Here we propose a simple yet effective direction prediction technique to reduce the log size for NIEOs. The key observation is that the directions of execution orders are often predictable. In production run, we designate a default direction as the prediction for the direction of an NIEO. Only the NIEOs contradicting the default should be recorded. In replay run, we can replay instructions (instruction blocks) according to the default orders until a recorded nondefault order is met.
There might be many rules that can greatly reduce the log size for recording NIEOs. However, to focus on the whole framework of LReplay rather than the detailed direction prediction rule, we simply follow the Occam's razor [Occam's Razor 2013] to adopt a straightforward earlier-end-first rule:
-The default direction of each NIEO between two instruction blocks is set to be from the block with an earlier end time to the block with a later end time (in case the end times of the blocks are the same, we treat the instruction block with smaller processor id as the earlier) 7 .
An illustrative example of the direction prediction of NIEOs is presented in Figure 4 . Considering that processor cores P 1 and P 2 are executing two instruction blocks respectively, where the two blocks have the same pending periods. There is no physical time order between these instruction blocks, thus all execution orders between these instruction blocks are noninferrable. For the first execution order between them (between u 0 and v 1 ), we set the default direction to be from P 1 to P 2 , since the instruction blocks have the same end times, and the id of P 1 is smaller then the id of P 2 . The actual order u 0 E − → v 1 matches the default direction, thus we do not need to record the execution order u 0 E − → v 1 (though it is noninferrable given the pending 7 In fact, we cannot decide whether an NIEO will match the default direction as soon as the NIEO is identified in production run, since the end times of the related instruction blocks are still unknown at that time.
To address this problem, we save the newly identified NIEO in a small temporary buffer, and defer making the decision until either of the related blocks ends. Fig. 4 . The noninferrable execution orders between two processors P 1 and P 2 . The bold dashed arrows represent the execution orders complying with the default, while the bold solid arrow represents the nondefault execution orders (i.e., whose directions do not match the default). Only the bold solid arrow should be recorded in production run. period information). Similarly, for instruction pairs (u 1 , v 2 ) and (u 2 , v 3 ), we set their default directions to be from P 1 to P 2 . The actual orders u 1 E − → v 2 and u 2 E − → v 3 match the default directions respectively, thus we do not need to record the two NIEOs. However, the default direction set for the subsequent order between u 20 and v 10 is still from P 1 to P 2 , while the actual order does not match that. Hence, we must record the nondefault NIEO v 10 E − → u 20 . The default direction between u 21 and v 21 is from P 1 to P 2 , which matches the actual order. Hence, we do not need to record the NIEO. In summary, although there are five NIEOs between P 1 and P 2 , four of them match the corresponding default directions (the bold dashed arrows in Figure 4) respectively. We only need to record one execution order v 10 E − → u 20 (the bold solid arrow in Figure 4) . Once an NIEO is identified to have nondefault direction, we save it in NEL, a dedicated log for NIEOs. The concrete recorded information of an NIEO includes the corresponding processor ids, and the increments of the related memory instruction counters (with respect to the previous recorded NIEO). As shown in Algorithm 2, LReplay uses 1-4 additional bits to show how many bits are required to record the increment of the memory instruction counter, which reduces the log size of NEL. It is worth noting that LReplay does not need to directly record the predicted (default) directions, since they can be inferred from the pending period information recorded in PPL.
Tackling Relaxed Memory Consistency Models
LReplay can support various memory consistency models, including sequential consistency [Scheurich and Dubois 1987] , processor consistency [Goodman 1989 ], and Godson-3 consistency [Chen et al. 2009a [Chen et al. , 2009b . For the relaxed memory consistency models, the instruction block size of LReplay should be reduced, and each load instruction that causes a violation with sequential consistency should be identified and recorded in a dedicated log named MCL.
For relaxed memory consistency models, the number of memory instructions in each instruction block should be reduced from sample period to sample period − 2 * memory inst window size, where memory inst window size is the size of memory instruction window 8 . The reason is that, although some consistency models allow a memory instruction to execute before prior memory instructions in program order, a memory instruction cannot execute before a prior memory instruction that has already retired from the memory instruction window. Hence, by removing the first and last memory inst window size instructions from each block, we can guarantee that there is a firm physical time order between two consecutive blocks on the same core.
Let us recall the example given at the beginning of Section 3.1.2. We mentioned that the memory instruction counter of a processor is 50 at time 100 and 120 at time 200. For sequential consistency, the pending period of the instruction block from memory instruction 50 to memory instruction 120 is the time interval [100, 200] . However, for relaxed memory consistency models such as processor consistency and Godson-3 consistency, the time interval [100, 200] is the pending period of the instruction block from memory instruction 50 + 16 to memory instruction 120 − 16 (assuming that the size of memory instruction widow is 16). Therefore, any instruction that is no earlier than memory instruction 50 + 16 starts after time 100, and any instruction that is no later than memory instruction 120 − 16 has ended before time 200.
On the other hand, LReplay can identify the load instructions that may cause violations of sequential consistency, and record the results of these load instructions (such order-value-hybrid recording method was originally proposed in Xu et al. [2006] to cope with recording/replaying in relaxed memory consistency systems). Concretely, we need a small CAM and a small RAM to save the addresses and results of recent load instructions, respectively (the CAM and RAM should have memory inst window size entries). When a store instruction w misses in L1 dcache, LReplay compares its address with the recorded addresses of the recent load instructions. If a recorded load instruction r has the same address with w, it may cause a violation of sequential consistency; therefore, we can record the processor id, memory instruction counter, and result of r in the MCL log. To reduce the log size of MCL, we use a compression technique that is similar to Algorithm 2.
Replaying with PPL, NEL, and MCL Log
We have developed an Instruction Set Simulator (ISS) Replayer to replay the recorded PPL (the log for pending period), NEL (the log for noninferrable execution order), and MCL (the additional log for relaxed memory consistency models). It replays the recorded instructions in a block manner, where each instruction block contains 1024 consecutive memory instructions in a same processor for the sequential consistency, or 992 (i.e., 1024 − 2 × 16) memory instructions for relaxed consistency models. The main idea of the ISS Replayer is to replay instruction blocks with disjoint pending periods according to the physical time orders, and instruction blocks with overlapping pending periods according to the predicted default directions. As shown in Algorithm 3, the ISS Replayer starts by calling the select block() function to select an instruction block B to replay. As the default, select block() function selects the block B, which has the earliest end time (and the smallest processor id) among all unfinished blocks according to the PPL log. After selecting B, the ISS Replayer calls the replay block() function to replay the selected block B. If an instruction u in block B is after some memory instruction v according to a nondefault noninferrable execution order (v E − → u) recorded in the NEL log, the ISS Replayer will switch to execution 
end time(B ) < end time(B))||(end time(B ) == end time(B)&&proc(B ) < proc(B)) then
Here each switch between instruction blocks is caused by that either a block is finished, or a recorded non-default noninferrable execution order is met. block(v) (i.e., the instruction block containing v) temporarily. Once v is executed, the ISS Replayer will switch back to execute the rest of B. To replay the execution under relaxed memory consistency models, the ISS Replayer adopts a technique similar to the order-value-hybrid method of RTR [Xu et al. 2006] : if u is recorded in MCL for possibly causing the violation of sequential consistency, the result of u is directly given by the corresponding value recorded in MCL. Once all instructions in block B have been executed, the ISS Replayer will mark B as a finished block, and select another unfinished instruction block to continue.
An example of replaying is shown in Figure 5 . The instruction blocks B 1 and B 2 are on processor P 1 , while the instruction blocks B 3 , B 4 , and B 5 are on processor P 2 . Instruction block B 1 is before instruction blocks B 4 and B 5 in physical time order, instruction block B 2 is before instruction block Concretely, the ISS Replayer first selects block B 1 to execute, since B 1 has the earliest end time and the smallest processor id among all unfinished instruction blocks. When block B 1 has been executed, the ISS Replayer switches (the 1st switch) to block B 3 , since B 3 has the earliest end time among all the unfinished instruction blocks. The ISS Replayer for LReplay cannot replay as fast as the production run (like some previous investigations [Hower and Hill 2008] ). While the replay speed may not be a main issue for many important applications of deterministic replay (especially for debugging), we agree that the speed of the replayer may still be important for some other applications. Hence, improving the speed of the replayer will be left as our future work.
Hardware Support of LReplay
In this subsection, we introduce the modification on Godson-3 RTL for supporting LReplay. For existing components of Godson-3, LReplay only adds a few registers in the processor core to facilitate observability of the memory instruction information, while keeping other existing components such as L2 cache, memory controller, cache coherence protocol, and switch+mesh interconnection unmodified. The major modification required by LReplay is adding a Log Processing Unit (LPU), the backbone of the hardware support for LReplay. LPU collects the memory instruction information by a dedicated and simple network with the star topology (note that a dedicated network is very common for DFD functionalities [Abramovici et al. 2008] ). The gathered information is processed by LPU to generate the logs, which can be shifted out through JTAG, a well-known DFD port in industry.
The detailed structure of LPU is illustrated in Figure 6 . LPU includes several grey and green blocks. The grey blocks (address signature, prediction register, NIEO buffer, PPL recording, and NEL recording) alone can support deterministic replay in a sequential consistent system, while the grey blocks (MCL recording and corresponding CAM/RAM) together with green blocks can support deterministic replay in a system with a relaxed memory consistency model.
Concretely, LPU has a 2240-bit address signature for each processor core to save the stored addresses of the recent 2048 memory instructions. To periodically flush out-of-date stored addresses from the signature, the whole signature is partitioned into five groups, where each group uses 448 bits to save the store addresses of 512 memory instructions. For every 512 memory instructions, the group containing the eldest addresses is totally flushed, and begins to accept the new store addresses. As a Fig. 6 . Hardware support needed by LReplay. The white blocks represent the existing components of the CMP, the grey blocks represent the hardware support for deterministic replay in a sequential consistent system, and the green blocks represent the additional hardware support for deterministic replay in a relaxed memory-consistent model. result, the signature always maintains a superset of the stored addresses of the recent 2048 memory instructions with a pseudo first-in-first-out manner.
Moreover, LPU maintains a p-bit prediction register per core (p is the number of cores in the system). The i-bit of the prediction register for core P j represents whether the most recent block of core P i ends earlier than the current block of core P j (1 means the most recent block of P i ends earlier, and 0 means the most recent block of P i ends later). Such bit is set to 1 at the end of each block of P i , and cleared to 0 at the end of each block on P j . The prediction registers assist to specify the default direction (from the block with an earlier end time to the block with a later end time) for each noninferrable execution order.
There is also a 16-entry NIEO buffer for each core to temporarily save those identified yet unrecorded NIEOs found in the current instruction block. The role of the NIEO buffer is twofold. First, the NIEO buffer removes nonexistent NIEOs. More specifically, LPU identifies an NIEO u E − → v when the memory instruction v misses in L1 dcache. However, v may be a speculative instruction that will finally be canceled, which means that the related NIEO does not exist. The NIEO buffer avoids recording such nonexistent NIEO by maintaining a state for each NIEO entry. When an NIEO is identified at the cache miss stage of v, the state of the NIEO entry is possible. The state is set to committed when v really commits. An uncommitted NIEO can be simply discarded at the end of the corresponding instruction block.
Second, the NIEO buffer filters NIEOs matching the default directions, thus reduces the number of directly recorded NIEOs. Consider a committed NIEO u E − → v, where u and v are executed on P i and P j , respectively. When a block containing either u or v has ended, LPU checks the i-th bit of P j 's prediction register to find out the predicted direction. If the bit is 1, which means that the predicted direction is from P i to P j , then the NIEO u E − → v matches the default direction. In this case, we do not need to record the NIEO, and the buffer simply discards it. If the bit is 0, which means that the default direction is from P j to P i , then the NIEO u E − → v does not match the default direction. In this case, LPU records the NIEO in the NEL log 9 .
To support relaxed memory consistency models, LPU needs some additional hardware support (green blocks in Figure 6 ). For each processor core, the values and addresses of recent 16 (the same with the size of memory instruction window) load instructions are gathered and saved in a RAM and a CAM, respectively. Both CAM and RAM adopt a First-In-First-Out (FIFO) scheme. When a new load instruction needs to be registered, the oldest entry in the CAM (and RAM) is discarded to make room for the new memory instruction. Noting that the CAM in LPU is very small (16 entries), it does not have a large impact on the overall power consumption of the CMP.
How Does LPU Operate?
Currently, industrial deterministic replay tools such as GDB [2013] have already been able to deal with single-threaded nondeterministic factors, such as uncertain instruction, interrupt, and exception. Hence, LPU focuses on dealing with the orders between user mode memory instructions from different threads (here we assume the memory races between system calls can be determined by well-written system software).
In the normal execution of the CMP, LPU is power-gated to avoid wasting power. Hence, LPU does not affect the booting of the Operating System (OS). When we need to record the order information of a parallel application, the OS wakes up LPU and maintains a watch list of cores for LPU. If a core in the watch list is in user mode, it will be inspected by LPU. Those cores which are in the system mode or not in the watch list will be simply ignored by LPU 10 .
Concretely, when a core (say, P 1 ) calls the switch to() function to execute a thread, the OS checks whether the thread belongs to an application that needs to be recorded. If the thread should be recorded, the OS configures LPU to add P 1 into the watch list of LPU. Otherwise, the OS configures LPU to remove P 1 from the watch list of LPU. Once P 1 is in the watch list, LPU will inspect P 1 as long as P 1 turns from system mode to user mode.
When P 1 is in both the watch list and the user mode, LPU inspects P 1 as follows.
-For every sample period, LPU samples the memory instruction counter of P 1 , and processes the values of memory instruction counters by Algorithm 1 to generate the PPL log. -When a store instruction of P 1 writes L1 dcache, LPU saves the address of its accessed cache line in the corresponding Bloom filter signature. -When a memory instruction of P 1 misses in L1 dcache, LPU compares its address with the Bloom filter signatures of other cores to identify whether there is an NIEO. If the address hits in some signature, LPU finds out a possible NIEO, and 9 If the NIEO buffer is about to overflow, LPU will save all the buffered committed NIEOs into the NEL directly and immediately, regardless of the correctness of direction prediction. However, in practice, we seldom meet such extreme cases (in > 99% times, the buffer contains only ≤ 2 committed NIEOs). 10 If a core is inspected by LPU, LPU collects the memory instruction information of the core into the address signature, CAM, and RAM for the core, identifies NIEOs and violations of sequential consistency about the core, and records logs for the core. If a core is ignored by LPU, LPU does not collect any memory instruction information about the ignored core.
immediately saves the information of the possible NIEO (including the related processor ids and memory instruction counters) in the NIEO buffer. -When the related memory instruction of a possible NIEO commits in P 1 , LPU changes the state of the NIEO to committed. -When the current instruction block of P 1 ends, LPU checks whether each related committed NIEO (in the NIEO buffers) matches the corresponding default direction. If a committed NIEO contradicts the default direction between instruction blocks, it is recorded in the NEL log. -(For relaxed memory consistency models only) when a store instruction of P 1 misses in L1 dcache, LPU compares its address with the addresses of recent load operations recorded in the CAMs of other cores, to identify whether there is some load operation possibly violating the sequential consistency. If a load operation violates sequential consistency, its processor id, memory instruction counter, and result are recorded in the MCL log.
When the inspected core P 1 returns from user mode to system mode, the current instruction block (say, B) on P 1 prematurely ends, and LPU stops to collect the memory instruction information into the address signature, CAM, and RAM for P 1 . Then, LPU waits for other inspected cores for a small period to complete their current instruction blocks. After that, the future instructions of other inspected cores have firm physical time orders with respect to block B on P 1 , thus LPU does not need to check NIEO about the past user mode instruction on P 1 anymore. From then on, the address signature, CAM, and RAM for P 1 are useless, thus LPU simply clears them. In this way, LPU can easily deal with a context switch with negligible cost.
EVALUATION
In this section, the log size and design cost of LReplay are empirically evaluated. The evaluation of LReplay is carried out on the RTL design of the 8-core Godson-3 . The experiment platform is an Xtreme-III simulation accelerator [Cadence 2013 ], which consists of 240 FPGAs to achieve a simulation speed of up to 400 KHz. The benchmarks used to evaluate LReplay are the core applications of SPLASH-2 [Woo et al. 1995] .
Log Size of LReplay
As mentioned, LReplay builds on a premise that most execution orders can be inferred from the recorded pending period information. Obviously, the proportion of the noninferrable execution orders in all execution orders is critical to the log size of LReplay. As shown in Figure 7 , on the SPLASH-2 benchmarks, 0.99%, 1.30%, and 1.47% of execution orders are found to be noninferrable for 256-cycle sample period, 512-cycle sample period, and 1024-cycle sample period, respectively. In fact, the proportions of NIEOs with respect to most benchmarks are less than 1%. The only exception is the benchmark ocean, in which case the proportion of NIEOs is 7.99% with a 1024-cycle sample period. A potential explanation to the aforesaid experimental results is that a processor seldom uses the data modified by other processors immediately (except lock, semaphore, or barrier).
Since we use a Bloom filter signature instead of an accurate CAM to save the addresses of the cache lines stored by the recent 1024 memory instructions, it is possible that a Bloom filter incorrectly identifies an execution order as an NIEO (false positive). Figure 8 presents the proportions of false positives in NIEOs under a 1024-cycle sample period. Among all the NIEOs reported by the Bloom filter (1.47% of execution orders), only a few are false positives (0.27% of execution orders on average). In fact, the false positive rates over most benchmarks are lower than 0.2%. The low false positive rate can be attributed to the low effective insertion rate of the Bloom filter. In most cases, less than 100 cache lines are stored by the most recent 2048 memory instructions. Such a small number of addresses can be effectively handled by the 2240-bit signature. The low false positive rate suggests that the Bloom filter is a cost-effective solution to save the store addresses for the identification of NIEOs.
Although the number of identified NIEOs is slightly increased when we use a Bloom filter to replace the large CAM in the LPU, the direction prediction technique presented in Section 3.2 still reduces the number of recorded NIEOs. As illustrated in Figure 4 , around 50.6% NIEOs can be correctly predicted by the default directions, thus we do not need to record them. More specifically, for the benchmark barnes, more than 57% NIEOs can be correctly predicted. Even for the benchmark water, whose NIEOs have already been few, the direction prediction technique can still prune 44% NIEOs.
Benefiting from the global-clock-based pending period information and direction prediction technique, LReplay can achieve a satisfactory log size at the 1024-cycle sample period. As shown in Figure 10 , the average size of the pending period log PPL (refer to Section 3.1.2) of LReplay is only 0.05B/K-Inst. Under this circumstance, the size of the NIEO log NEL plays an important role in the overall log size. Without the direction prediction technique, the average log size of NEL achieves 0.24B/K-Inst. In other words, the overall log size of LReplay without direction prediction achieves 0.29B/KInst on average (leftside, Figure 10 ). Though such an average log size has already been satisfactory, for some specific benchmark incorporating many noninferrable execution orders (e.g., benchmark ocean), the overall log size without direction prediction can still be larger than 1B/K-Inst. In contrast, with the direction prediction technique, the average log size of NEL can be significantly reduced to 0.12B/K-Inst. Consequently, the overall log size of LReplay can be reduced to 0.17B/K-Inst over SPLASH-2 benchmarks (rightside, Figure 10 ). Even in the worst case (benchmark ocean), the overall log size is only 0.48B/K-Inst. Figure 11 presents the log size of LReplay for Godson-3 consistency, which is a relaxed memory consistency model allowing a load instruction to precede prior memory instructions in program order. To support Godson-3 consistency, LReplay needs to record MCL in addition to PPL and NEL. The average log size of MCL over SPLASH-2 benchmarks is 0.40B/K-Inst. As a consequence, the overall log size of LReplay for Godson-3 consistency is about 0.57B/K-Inst on average. Thus, the log of LReplay is small enough to be transferred by JTAG whose bandwidth is less than 5MB per second.
Moreover, we evaluate the log size of LReplay under an extreme case. We randomly generate an 8-thread test program, which contains 50% load instructions and 25% Fig. 11 . Log size of LReplay for Godson-3 consistency at 1024-cycle sample period. PPL is the log for pending period, NEL is the log for noninferrable execution order, and MCL is the additional log for relaxed memory consistency models. store instructions. All 8 threads pseudorandomly access a uniform small memory region (512KB) intensively. The small memory region forces each thread to contend with others frequently. Hence, the test program is more memory contention intensive than normal parallel applications. Furthermore, the communications of the test program are much more flexible than the regular lock-access-unlock pattern of most parallel applications (i.e., SPLASH-2 benchmarks). Even for such an extreme test program, the overall log size of LReplay under Godson-3 consistency is still only 1.01B/K-Inst (0.07B/K-Inst PPL, 0.43B/K-Inst NEL, and 0.51B/K-Inst MCL), which well demonstrates the robustness of LReplay.
Design Cost of LReplay
The design cost of LReplay is introduced from the following aspects.
Logical Design. To implement LReplay on Godson-3, we trivially modify the processor core to connect about 200 internal signals per core (for memory instruction information) to the LPU. No modification to other existing components of Godson-3 is needed.
Physical Design. LReplay mainly requires a dedicated Network-On-Chip (NOC) with about 2000 wires and an LPU. Both placing and routing an NOC with 2000 wires are trivial for the physical design of a commercial processor (considering that the crossbar of Godson-3 has hundreds of thousands of wires). For 8-core Godson-3, the area consumption of the LPU is only about 1.5 square millimeters under STMicro 65nm GP/LP mixed process according to the report of Synopsys's Design Compiler. Noting that the overall area of the 8-core Godson-3 is about 300 square millimeters, it can be concluded that the area consumption of LReplay is only about 0.5% of the chip 11 .
Verification. The existing components of Godson-3 are almost unmodified, thus their functionalities are not affected by LReplay. Consequently, no reverification is required for Godson-3.
Performance. There is no performance loss in production run since LReplay does not have any impact on the normal functionality of Godson-3.
In summary, the low design cost of LReplay shows the potential of integrating hardware-assisted deterministic replay into future industrial processors.
RELATED WORK
Hardware-Assisted Deterministic Replay
The history of hardware-assisted deterministic replay can be traced back to early 1990s [Bacon and Goldstein 1991] . Bacon and Goldstein's technique can record the cache coherence messages on the snooping bus of a multiprocessor system. However, the log size of their work is quite large. Later, some hardware-assisted deterministic replay schemes adopted the transitivity reduction technique proposed by Netzer [1993] to prune unnecessary logical time orders between instructions. Flight Data Recorder (FDR) [Xu et al. 2003 ] is one such scheme. BugNet further improves FDR by adopting a hardware-based dictionary to record the outputs of load instructions [Narayanasamy et al. 2005] . In 2006, an enhanced version of FDR, named RTR, was proposed [Xu et al. 2006] . The main improvement of RTR is that RTR records the logical time orders between memory instructions while FDR records merely the execution orders, which enables a more aggressive transitivity reduction technique.
There are also some hardware-assisted deterministic replay schemes managing to record the logical time orders between instruction blocks rather than individual instructions. Strata [Narayanasamy et al. 2006] partitions the whole execution into many stratums. Each stratum is a vector of counters, recording the number of memory instructions issued by each processor between the last and the current stratums. Before any processor issues a second inter-process communication, a stratum is stored. Rerun [Hower and Hill 2008] records the length of each episode (an instruction block without inter-process communication) and the execution orders between episodes. Concurrent with the conference version of LReplay , Voskuilen et al. proposed Timetraveler [Voskuilen et al. 2010] , which exploits an acyclic race to enlarge the length of instruction blocks. The log size of Timetraveler is comparable with that of LReplay. Coreracer [Pokam et al. 2011 ] is a practical memory race recorder for x86 CMP which records the logical time orders between instruction blocks. Rather than modifying the cache coherence protocol to record block orders, it makes use of an existing x86 feature named invariant timestamp counter. Furthermore, Coreracer can elegantly tackle relaxed memory consistency (e.g., TSO) with the reordered store window technique.
Moreover, Lee et al. proposed to use an offline symbolic analyzer to deduce execution orders between memory instructions . Their scheme only requires relatively simple hardware support for logging cache misses, and can be enhanced to cope with relaxed memory consistency systems . However, the scheme needs a long analyzing time to deduce execution orders (maybe hundreds of times longer than the execution time of production run).
Substantially different from existing schemes which directly record the logical time order between instructions or instruction blocks, LReplay makes use of a global clock to record pending period information. Benefiting from the global clock, LReplay records position, size, and relation of instruction blocks with a tiny log size (0.05B/K-Inst). In contrast, previous block-based schemes like Pokam et al. [2011] still need to accurately record the instruction block information in the absence of the global clock. Moreover, LReplay employs the proposed direction prediction technique to further reduce the log size for the residual noninferrable execution orders. As a result, the overall log size of LReplay is only 0.17B/K-Inst over SPLASH-2 benchmarks.
Another advantage of LReplay is that it aims at fulfilling the industrial DFD guidelines. Different from many existing hardware-assisted deterministic replay schemes which entangle themselves with different components of a CMP, LReplay is decoupled from the normal functionality to avoid modifying existing components of a processor. Furthermore, to the best of our knowledge, LReplay is the first attempt to evaluate hardware-assisted deterministic replay on relaxed memory consistency models (which are the mainstream of industry).
Software-Only Deterministic Replay
Researchers have also paid many efforts on software-only deterministic replay schemes. Software-only schemes purely rely on the supports of specific system software environments (e.g., operating system, virtual machine, and library) to record the nondeterministic factors including execution orders [Dunlap et al. 2008; Leblanc and Mellor-Crummey 1987; Patil et al. 2010] . However, these schemes may cause significant slowdown to the production run. Veeraraghavan et al. proposed to convert the ordering among memory instructions to the thread scheduling so as to reduce the slowdown of production run caused by recording [Veeraraghavan et al. 2011] . Nonetheless, their scheme may cause 50% reduction to the overall throughput of the system.
Recently, some aggressive software-only schemes have recorded the synchronization orders and program outputs only, and have given up recording the execution orders between normal memory instructions. These schemes either need to repeatedly replay the program (or program section) until the outputs of a replay run happen to be the same to those of the production run [Lee et al. 2010; Park et al. 2009] or employ timeconsuming constraint solvers to deduce all execution orders [Altekar and Stoica 2009] .
Deterministic Parallelism
Different from deterministic replay, deterministic parallelism achieves determinism without logging. The main idea is to restrict the interleaving between memory instructions to comply with some predefined orderings [Bergan et al. 2010; Devietti et al. 2009; Hower et al. 2011; Montesinos et al. 2008 Montesinos et al. , 2009 . However, this strategy often requires to brute-forcely pause threads even when they should be able to proceed. As a result, deterministic parallelism brings remarkable performance costs to the production run. For example, CoreDet [Bergan et al. 2010 ] may slow down the production run for 1.1-6 times.
CONCLUSION
Hardware-assisted deterministic replay is a promising solution for debugging parallel programs, and has attracted broad interests over the past decade. To bridge the gap between academia and industry, LReplay makes an attempt to combine emerging techniques developed by academia (i.e., global clock and pending period of CMP) with the industrial requirements of DFD. The low cost and high efficiency of LReplay on the RTL design of a commercial processor can possibly alleviate the hesitations of industry about the risk brought by adding hardware support for deterministic replay.
