Abstract--Advances in ILP techniques enable strict consistency models to relax memory order through speculative execution of memory operations. However, ordering constraints still hinder the performance because speculatively executed operations cannot be committed out of program order for the possibility of mis-speculation. In this paper, we propose a new technique which allows memory operations to be non-speculatively committed out of order without violating consistency constraints.
I. INTRODUCTION
HE memory consistency model is a crucial factor for the performance of shared memory multiprocessor systems. Strict consistency models (such as sequential consistency (SC) or total store ordering (TSO)) offer intuitive programming interface, but the inability to perform memory operations out of program order limits the performance.
Modern microprocessors incorporate techniques to exploit instruction-level parallelism (ILP). ILP techniques enable aggressive optimizations for strict consistency models, which relax memory order speculatively. Gharachorloo et al. [2] proposed hardware prefetch and speculative load execution. These two optimizations significantly improved the performance of strict consistency models through issuing memory operations out of program order. However, ordering constraints in strict consistency models still hinder the performance [6] . First, the store-to-load ordering (in SC) prohibits a load from bypassing prior stores and retiring from the reorder buffer. It may cause a high latency store to block the instruction flow through the reorder buffer. Second, the store-to-store ordering (in SC or TSO) forces stores to be performed one after another. It may cause underutilization of memory units and cache ports.
To alleviate above problems, speculative retirement [6] and SC++ [3] are proposed. Those techniques allow memory operations to be speculatively committed 1 out of program order. Thus, memory operations no longer stall processor pipelines waiting for the completion of prior operations. However, out-of-order commitment may incur a consistency violation. To recover from possible consistency violations due to speculative commitment, a processor should store the 1 We call an operation is committed when it updates the processor and memory state; a load is committed when it update destination register and is retired from the reorder buffer, and a store is committed when it updates the cache and is removed from the store buffer (or memory queue).
architectural state into an additional history buffer and roll back when a mis-speculation is detected.
The effectiveness of the speculative commitment techniques is mainly limited by the size of a history buffer; the number of instructions can be committed speculatively at a time. The size of a history buffer should scale with the performance gap between processor and memory subsystem. To hide the higher memory latency, a processor should provide the larger history buffer [3] . Especially, in case of a speculative commitment of a load, the history buffer should keep rollback information not only for the load itself but also for all of the instructions following the load. Another defect of speculative mechanisms is that previous architectural state should be loaded and updated atomically. This read-modify-write operation may increase contention for the resources. Lastly, frequent rollback due to the early prefetch or false sharing would result in performance degradation [6] .
In this paper, we propose a new mechanism, the request reorder buffer (RRB) technique, to alleviate the impact of the store-to-load and store-to-store ordering constraints in strict consistency models. The RRB technique enables nonspeculative commitment of memory operations (both loads and stores) bypassing prior stores. To guarantee consistency constraints, the RRB technique rearranges the global order of memory operations by delaying a cache coherence request. Because the RRB technique does not require rollback, long memory access latencies can be hidden with small cost of storage, and the negative effects of the speculative techniques can be removed.
II. THE RRB ARCHITECTURE
This section describes the RRB technique. We assume a cc-NUMA system similar to SGI Origin2000 [4] . Processors incorporate hardware prefetch, speculative load execution, and store buffering. Also, an invalidation-based cache coherence protocol is used. We will describe how the RRB technique works in SC, and it can be directly applied to TSO.
A. Delaying Coherence Request with the RRB Technique
In the sample program in Fig. 1 , suppose that P0 executes the operation b before a and sets r1 to 0. In this situation, the out-of-order execution of b may or may not violate sequential consistency; if the result of b is the same as the result produced by an in-order execution of b, it does not violate sequential consistency. Otherwise, the out-of-order execution may cause an inconsistency. For example, suppose that the operations are executed in the order (b, a, c, d ). In this case, the in-order execution of b after a, (a, b, c, d) , would produce the same result, r1=0, as the out-of-order execution. Thus, the execution (b, a, c, d ) is sequentially consistent. However, the out-of-order execution sometimes leads to an inconsistency. If c and d are performed between b and a, (b, c, d, a), the result of b, r1=0, is different from that can be obtained by the in-order execution (c, d, a, b), which produces r1=1. In this case, the out-of-order execution violates sequential consistency; the result, r1=0 and r2=0, can never be obtained in sequential consistency. This inconsistency is caused by the fact that c and b access the same memory location and at least one of them is a store-c is a conflict operation with b. Thus, the two execution orders of (b, c) and (c, b) produce different results from each other. Because a conflict operation was executed between b and a, the out-of-order execution of b produces a different result from the in-order execution and it may lead to an inconsistency. To avoid this inconsistency, in previous schemes [3] [2] [6] , P0 nullifies the result of the out-of-order execution of b when it receives the invalidation request of c, then re-issues b.
On the other hand, if the invalidation request of c is delayed until a is complete, we can avoid the re-issue of b. By delaying the invalidation request, we can guarantee that the result of b is the same as the in-order execution of b because c will never be executed between b and a-the execution order of (b, a, c, d) is enforced by delaying the invalidation request of c. Delaying coherence request does not affect the correctness of a program because operations from different processors can be performed at any order. Delaying the coherence request has been proposed by Adve and Hill [1] to perform synchronization operations out of order without violating ordering constraints.
As seen by above example, the out-of-order execution of an operation does not violate consistency constraints as long as the coherence request to the accessed block is delayed until the operation can be issued in-order. In this paper, we propose the RRB technique, which allows an operation to be committed out of order. Unlike previous schemes [3] [6] which relies on speculative commitment and rollback, proposed scheme commits an operation non-speculatively while consistency constraints are guaranteed by delaying coherence requests. Note that an operation is committed out of order only if a load is about to be retired or a store has already been retired but buffered in the store buffer (or memory queue). Thus, the RRB technique does not cause an imprecise exception.
The request reorder buffer (RRB), which is a special buffer between processor and cache controller, takes charge of delaying coherence requests. When a processor commits an operation out of order, the address of the accessed cache block is stored in the RRB. Whenever a coherence request is received, cache controller should check the RRB before it processes the coherence request. If the target address of the coherence request matches in the RRB, the request is delayed until the prior operations of the committed operation (pointed by a field in the RRB) are complete. Fig. 2 shows an example of out-of-order commitment using the RRB technique. Example of out-of-order commitment. If store C is committed out of order, the address C is registered in the RRB. When a coherence request to C is received, the request is stored in the RRB until prior stores are complete.
B. Deadlock Avoidance
Because an operation may be delayed indefinitely, the RRB technique may cause a deadlock. In the sample program in Fig.  1 , suppose that P0 commits b bypassing a and the invalidation request of c is delayed at P0. There is a wait-for dependency between a and c, which is denoted by a→c : c waits for the completion of a. In this situation, if P1 also commits d out of order, the invalidation request of a should be delayed until c is complete. Thus, there is also a wait-for dependency c→a. This cyclic dependency leads to a deadlock situation. In this paper, we propose a deadlock avoidance scheme which limits the out-of-order commitment based on the address of the memory block. From now on, we will denote the address of memory block accessed by an operation x by '&x'.
Deadlock avoidance scheme: Processors are allowed to commit an operation x bypassing an operation y, if and only if &x > &y. Correctness: To perform the proof by a contradiction, suppose that the RRB technique with proposed deadlock avoidance scheme makes cyclic wait-for dependencies among several processors as follows. a 0 →a 1 →…→a n →a 0 (a k : memory operations) --(1) A wait-for dependency a k →a k+1 implies that a k+1 accesses the same memory location with an operation x which is committed bypassing a k (&x = &a k+1 ). By proposed scheme, x can bypass a k if and only if &a k < &x, thus, &a k < &a k+1 .
Generally, the supposed dependency (1) means &a 0 < &a 1 <…< &a n < &a 0 It is a contradiction. Therefore, there is no cycle caused by out-of-order commitment. ■ As an example, in Fig. 1 , if &b > &a, P0 is allowed to commit b bypassing a but P1 cannot commit d out of program order. Thus, cyclic wait-for dependency is not created and deadlock is avoided.
Although the proposed scheme avoids deadlock, it limits the performance because operations may not be committed out of order due to the deadlock avoidance condition. In general, the more operations are committed out of order, the higher performance is achieved. With proposed deadlock avoidance scheme, the performance is highly dependent on the memory access patterns of an application. Proposed scheme performs well on applications of which the addresses of long latency stores are usually lower than addresses of following operations.
We expect that a more sophisticated deadlock avoidance scheme could achieve the more performance enhancement through, for example, exploiting memory access patterns of applications or relocating addresses of memory blocks by compilers. Fig. 3 shows the block diagram of processor architecture for the RRB technique. For simplicity, we present logic blocks related to the RRB technique. The memory queue is similar to the address queue of MIPS R10000 processor, which holds memory operations in program order. The retire logic decides whether an operation can be retired or not. The issue logic performs operations in memory queue. Without the RRB technique, a completed load cannot be retired bypassing stores and a store which is retired and buffered in memory queue cannot be performed bypassing stores. In the RRB technique, loads and stores are committed bypassing prior stores. To delay the coherence request, an RRB entry for the committed operation is created.
C. Implementation of the RRB Technique
An RRB entry consists of four fields; addr field stores the block address of a committed instruction, prior_st field points to the nearest prior store to the committed operation in the memory queue, wait_req field stores the delayed coherence requests. At most two coherence requests can be stored in wait_req field; a coherence request from the directory and a replacement request. Note that once a coherence request is delayed by the RRB, the directory does not send another coherence request to that block until the delayed coherence request is replied. The last field of the RRB is op_type field which indicates the type (load or store) of the committed operation. The op_type field is required to prohibit useless delays; if a load is committed out of order and the accessed block happens to be in the cache with exclusive state, an invalidation request to the block should be delayed, but a downgrade request can be serviced because loads are not conflict operations. A downgrade request is delayed only if there is an RRB entry whose op_type is 'store'.
To implement deadlock avoidance scheme, we add the reorder logic and one-bit gt_prev field to the memory queue; gt_prev field indicates whether the operation can be committed out of order with respect to the deadlock avoidance condition. The reorder logic sets gt_prev field of each operation in the memory queue through comparing the target address of the operation with that of prior operations. If an operation accesses the highest address than all of prior instructions, gt_prev field is set indicating the operation can be committed.
Out-of-order commitment of loads and stores are actually done by the retire logic (loads) and the issue logic (stores). If a load reaches to the top of the reorder buffer, the retire logic checks the memory queue status. If the gt_prev field of the load is set, the load is committed even if there are prior incomplete stores. Stores are committed by the issue logic. If there is a store which is retired and its gt_prev field is set, the issue logic performs the store to the cache. Whenever an operation is committed out of order, the operation is registered in the RRB and removed from the memory queue.
Whenever a coherence request is received, the cache controller checks the RRB before it processes the coherence request. If the target address and the op_type of the coherence request matches with an RRB entry, the request is removed from the cache controller and stored in the wait_req field of the RRB entry.
An RRB entry is removed from the RRB when all of operations prior to the operation which is pointed by the prior_st field are complete. Whenever an operation is complete and removed from the head of memory queue, the issue logic should search the prior_st field of the RRB. If it matches, the RRB entries are freed and the delayed coherence requests are revived, if any.
When the RRB technique is implemented in cc-NUMA systems, it would complicate the design of the memory subsystem because resource contention in the memory subsystem may incur a deadlock situation. For example, suppose that an operation y is delayed by the RRB mechanism and waits for the completion of x. If y occupies a resource which is required for the completion of x, deadlock could occur due to resource contention (x also waits for y to release the resource).
Because a detailed implementation of memory subsystem is significantly different for each multiprocessor system, we will describe some of the general techniques to avoid deadlock due to resource contention. One simple method is to release all of resources occupied by a delayed memory operation. It can be implemented by retrying the delayed operation, rather than blocking it at a cache. Because the delayed operation would not hold any resources but release and re-occupy them, a fair arbitration mechanism for resources would guarantee deadlock freedom. Another method is to guarantee the completion of the oldest memory operation of each processor by limiting the number of outstanding operations from one processor. In this method, processors are allowed to issue maximum n operations at a time and the memory system supplies plenty of resources enough for all of the outstanding operations to be performed without contention. If each processor reserves the issue slot for the oldest operation, the oldest operation in a processor can be always issued and never blocked for resources. Thus, we can avoid deadlock.
III. PERFORMANCE EVALUATION
We used RSIM [5] to simulate cc-NUMA system with 16 processors. Table I shows the base system configuration. Benchmark applications are from the SPLASH2 suite, except for Mp3d from SPLASH. Table II gives the input sizes used for the benchmark applications. We assume a relatively large L2 cache to eliminate capacity and conflict misses, so that performance difference among the memory models is solely due to the intrinsic behavior of the models. In our experiments, all implementations use non-blocking caches, hardware prefetch, speculative load execution and store buffering. We set the RRB entry size to 64. We simulated five implementations; SC, TSO, SC+RRB, TSO+RRB, and RC. Note that the performance of TSO is the upper limit of relaxing store-to-load constraint in SC. RC is the upper limit of relaxing all of the ordering constraints. Fig . 4 shows the execution time of benchmarks normalized to SC. In Radix and Raytrace, 17.9% and 16.8% of the execution time were reduced in SC. On the average, the RRB technique achieves the performance improvement of 10.5% in SC and 6.3% in TSO. The performance gap between SC+RRB and TSO is within 3.8%. In Barnes and Raytrace, SC+RRB outperforms TSO because these applications are more sensitive to store-to-store ordering. Comparing to RC, gap between SC+RRB and RC is within 6.4% and TSO+RRB even outperforms RC. It is because the RRB technique effectively handles the contention on a lock variable by committing unlock operation out of order and delaying early fetches to the lock variable. Those early fetches are due to the spinning on a lock variable. If they are not delayed, they usually degrade the performance of lock hand-off.
64 entries of RRB were sufficient. In most applications, the proposed deadlock avoidance scheme was too restrictive and sensitive to memory access patterns of applications. Fig. 5 shows the normalized execution time when we increase the network latency to 5 times the remote latency described in table 1. On increasing the network latency, the performance gain by the RRB technique increased. In TSO, 12.1% of execution time was reduced by the RRB technique on the average. 
IV. CONCLUSION
We have presented the RRB technique to enhance performance of strict consistency models. With the RRB technique, memory operations can be committed out of program order without violating consistency constraints. Current proposal limits the performance gain to avoid deadlock. We expect that deadlock avoidance condition could be more relaxed through exploiting memory access patterns of applications, or relocating address of memory operations by compilers.
