Abstract
Introduction
Clustered microarchitectures enable a scalable core design because intracluster operation remains fast irrespective of the entire execution width of the microprocessor core. On the other hand, intercluster communication generally requires a latency period equivalent to that of centralized microarchitecture. Therefore, the key to fully utilize the potential of clustered microarchitectures is to assign instructions to clusters such that the communication between the clusters is reduced.
The assignment of instructions to clusters (hereafter called "steering") must be accomplished before the pipeline process proceeds into the clusters. In the remainder of this paper, we focus on models that dynamically steer ordinal RISC instructions at the front-end. Steering schemes that are extremely dependent on special compilers are not desirable from the viewpoint of usability, because they require to be recompiled between cores of different size. Problems such as the complexity in renaming and dispatching logics that are encountered by the dynamic steering model can be solved by cooperating with trace caches to merge techniques that steer instructions at the retire time.
For efficient steering schemes, previous works focused on data dependencies and workload balancing. These works, in particular, focus on identifying the clusters that produce the instruction's operands (OP-cluster) and have the least workload (LW-cluster), and the most suitable cluster is then selected from among them. Many previous works on clustered microarchitectures suggest that assigning instructions by dataflow and workload can result in a better performance than assigning them by functional units.
In these works, memory dependencies have often been idealized as register dependencies that can be perfectly analyzed at decode time. However, store, load, and consumer instructions that use the result of a load instruction as an operand have to be examined carefully because they often become the critical paths in code execution. Actually, memory dependence information at the steering stage is currently highly ambiguous because the memory address is not calculated at that time.
In this work, we focus on combining a steering scheme with memory dependence predictors. First, we propose a technique that assigns the STORE instructions and LOAD instructions; the LOAD depends on the STORE to the same cluster. Hereafter, the store, load, and consumer instructions are represented in capital letters; STORE, LOAD, and CONSUMER imply the dependencies between them. When this scheme is introduced into the steering logic, the communication delay between clusters is removed from the tag forwarding of the memory dependent instruction, and an earlier wakeup of the memory instruction can be expected. In this work, we also propose a mechanism that enables intracluster high-speed memory reference when memorydependent instructions are assigned to the same cluster.
The rest of this paper is organized as follows. Section 2 details the baseline clustered microarchitecture and presents our baseline steering scheme. Section 3 evaluates the impact of memory-aware steering schemes of clustered microarchitectures for various conditions. Section 4 proposes a technique called "Distributed Speculative Memory Forwarding" (DSMF) that uses the STORE value as a CON-SUMER operand, and the forwarding is performed intracluster irrespective of the cache and communication delay. The front-end of this technique requires additional preparation; nevertheless, this does not result in additional overheads on the critical path in the backend. Section 5 presents the evaluated data of DSMF. Section 6 outlines the related works, while Section 7 concludes this work.
Baseline Clustered Microarchitecture

Clustered Microarchitecture
The microprocessor core has achieved an improved performance based on the design policy of "Operating on a faster clock and with a wider execution width." However, the bottleneck of the core design becomes the wire delay of the data path to which a number of functional units hang. Thus, the operation frequency decreases drastically with an increase in the execution width. One of the designs proposed to overcome this dilemma is a "clustered microarchitecture." It is expected that clustered microarchitectures will improve the performance in the core, while the "multi-core" approach aims at performance improvement of the entire chip.
In the clustered microarchitecture design, a wide execution core is partitioned into smaller execution cores called "clusters." Clustered microarchitectures can improve the operating clock rate by limiting the range where the tag and data can forward by one cycle in the cluster. The advantage of cauterizing to simple super-pipelining is that the number of instructions per cycle (IPC) does not decrease when processing is performed in the same cluster. A typical clustered microarchitecture retains binary compatibility to conventional RISC architectures by assigning codes to each cluster dynamically. The clusters can behave as a wide core by enabling a synchronous mechanism among schedulers, registers, and memory.
Outline of Our Architecture
Figure1 shows the block diagram of our baseline clustered microarchitecture. Pipeline organizations and instruction sets of this architecture are based on the DEC Alpha 21264 processor. While the backend of the Alpha processor has a centralized issue queue and clusterized data paths, the issue queue in our architecture has also been partitioned into a clusterized part of the processor. For each instruction, the cluster in which the instruction is processed is determined by the steering logic added to the front-end, and the instruction is dispatched to the issue queue of that cluster.
The backend is organized into eight clusters. Each cluster has one ALU, issue queue, register file, and data path. In order to take into account to the effect of memory dependence, our baseline microarchitecture assumes some naive designs and simplifies the evaluation model. In the proposed microarchitecture, each cluster has a replicated register file, and the delay in intercluster communication is constant. Moreover, each cluster is capable of executing all types of instructions.
Frontend
The "Local Distance" scheme [5] for predicting the cluster with the earliest issue is adopted in the steering scheme of the baseline. An instruction is assigned to the cluster from which the operand will be produced. However, the instruction is assigned to the cluster with the least workload when a large number of other instructions are already assigned to the operand producing cluster following the operand producing instruction.
Backend
The executed result is forwarded to all the clusters because the baseline architecture assumes a naive replicated register file. The executed result in another cluster is reflected in a local replicated register file so that all clusters maintain the same register context. Although the issue queue has separate content in each cluster, in order to solve the dependence distributed to two clusters, it forwards the tag of the issued instruction to all the clusters. Efficient models that exclude such broadcast between clusters are researched in [1] .
Experimental Parameters
The baseline microarchitecture detailed above was described on a simulator; all results in this paper are generated with this. The simulator is cycle accurate, execution driven and the effect of the branch misprediction can be reflected. Our design parameters for evaluation are shown in table1. The evaluation was made with the SPEC95 suite (head instructions of 16M each) as an input. All the benchmarks were compiled for the Alpha binary using gcc v2.95 with -O2 option.
Memory Dependence Steering
Ordering of the Memory Access
To keep the dependence between memory instructions, the scheduler have to control the issue timing of memory instructions. In the baseline architecture, the following rules similar to Alpha 21264 were adopted.
• It is assumed that there is WAW dependence between all the store instructions and the store instruction is executed by FIFO.
• The parent store instruction is predicted in the frontend stage of the load instruction, and the load instruction waits for the predicted store instruction Table[ 6] adopted with Alpha 21264 memorizes the presence of a past memory RAW violation and predicts the presence of memory dependence. If the load instruction is predicted as "dependent load", then the dependence between the load instruction and all the preceding store instructions is assumed.
StoreSet [3] is one of the other techniques that predict memory dependence. It memorizes the PCs of the store instruction associated with the PC of the dependent load instruction when memory RAW violation occurs. When the load instruction of same PC comes, it predicts that the same store instruction would be the parent. It is reported that StoreSet can predict memory dependence in very high accuracy though it needs some additional tables.
In the clustered microarchitecture, such dependence chain of the memory instruction is also affected by the delay of the inter cluster communication.
Additional Logic to the Steering
It is desirable that no intercluster communication latency is inserted in the tag forwarding of the STORE, LOAD, and CONSUMER instructions. However, conventional registeraware steering schemes do not attempt to steer STORE and LOAD in the same cluster even if LOAD and CONSUMER can be steered there.
In this section, we examine a scheme that steers these STORE, LOAD, and CONSUMER instructions in the same cluster by using memory dependence prediction. In our "Memory Dependence Steering," the instruction that is predicted as the parent of memory dependence is also passed to the steering logic as one of the operands. Moreover, we examined which cluster should be prioritized when both the OP-cluster of memory and the OP-cluster of register exist.
Effect of the Memory Dependence Steering
The effect of Memory Dependence Steering can change by modifying the settings of the clustered microarchitecture, such as accuracy of the memory dependence prediction, memory reordering scheme, and latency of the cache. We therefore examined the effect by varying each of these settings.
Figure2 shows the effect of Memory Dependence Steering on IPC. By varying the quality of memory dependence information and D1 access latency, we compared Memory Dependence Steering with the "local distance" scheme and ultimate steering. Ultimate steering is an oracle model that chooses an adequate cluster for the earliest issue. The graph consistently shows that performance can be enhanced by using memory dependence information for steering. It is also shown that an effect equivalent to that obtained with the oracle model is achieved even with a feasible predictor. 
Distributed Speculative Memory Forwarding
In this section, the details are described about the technique "Distributed Speculative Memory Forwarding" that allows smooth execution of the succeeding instructions regardless of the cache access latency. Figure3 shows the example of the D1 cache structure. A centralized D1 model(fig3(a)) can simplify cache controls, however it will increase the number of access cycles that are accelerated with clustering. Moreover, the access latency increases because the cache port should be arbitrated via intercluster networks. Such a centralized D1 design has a possibility of denying performance improvement by clustering because IPC tends to decrease by about 5% when the D1 hit latency increases by one cycle.
Cache Designs for Clustered Microarchitectures
The number of access cycles of a cache can be reduced at the partitioned cache structure (fig3(b) ). The access of each cache array remains fast, and port arbitration is performed locally in a cluster. We can roughly consider two partitioned cache designs: (1) replicated caches and (2) banked caches. Both these designs have their merits and demerits. The effect of cache capacity is reduced in the replication design (this implies that the number of D1 misses will increase); on the other hand, D1 hits/misses may change by steering in the banked design. In both cases, a frequent cache miss and changeable access latency increase the penalty by the scheduler replay because the scheduler has to predict the cache access latency before the LOAD instruction is issued.
Outline of the DSMF
In DSMF, the value of the STORE instruction is forwarded directly to the CONSUMER instruction as an operand. This forwarding becomes possible by setting the dependence between the STORE and CONSUMER instructions according to the memory dependence prediction. DSMF can cooperate with conventional sophisticated steering and scheduling techniques without major variations because it is done by rewriting the source operand tag of the CONSUMER instruction in the front-end. The STORE and CONSUMER instructions are assigned to the same cluster according to the rewritten dependence, and the CON-SUMER instruction is issued with the scheduler following the STORE instruction. The forwarding from the STORE instruction is carried out in the clusterized backend. The unit for forwarding STORE data, called "local forward buffer( LFB )," is added to each cluster.
The CONSUMER instruction can advance execution without waiting for the completion of the LOAD instruction by reading the value that the STORE instruction left in the LFB. On the other hand, a normal cache access is performed by the LOAD instruction, and the speculation is verified.
When DSMF is applied, the STORE and CONSUMER instructions are executed continuously, and the latency of the LOAD instruction (cache access latency) is concealed.
implementation of DSMF
Preparation for Forwarding
In DSMF, the following operations are added to the frontend processing of the STORE, LOAD, and CONSUMER instructions.
In the front-end processing of the STORE instruction, the instruction's tag is memorized in the memory dependence predictor in association with the PC of the STORE. The memory dependence predictor is similar to the conventional STORESET predictor [3] . The flag for forwarding is set to the STORE instruction when the memory dependence predictor holds an entry for this STORE.
In the front-end processing of the LOAD instruction, the memory dependence predictor is accessed in association with the PC of the LOAD instruction. The tag and assigned cluster of the STORE instruction are obtained if the predictor hits. This information, which is used by the CON-SUMER instruction, is included in the table called "Parent Destination Table (PDT) ." The PDT is made with a CAM associated with the LOAD instruction's tag; for each entry, it contains the tag and assigned cluster of the parent STORE instruction. Further, the flag for forwarding is set to the LOAD instruction when the memory dependence predictor hits.
In the front-end processing of the CONSUMER instruction, the PDT is accessed in association with the tag of the operand. The information of the tag and the assigned cluster are obtained if the PDT hits. At the same time, the table called "Miss History Table ( MHT)" is accessed in association with the PC of the CONSUMER instruction. The MHT is made with an RAM and functions as a confidence counter. The value "1" is set to the associated entry when misforwarding occurs due to misprediction. When the PDT hits and the values obtained from MHT are 0, then DSMF is applied to the CONSUMER instruction.
When DSMF is applied, the CONSUMER instruction is prepared in the front-end as follows. First, the source operand (tag of the LOAD instruction) is rewritten to the tag of the STORE instruction. The steering logic assigns this CONSUMER instruction to the same cluster as the STORE instruction by this rewriting. Next, the flag for forwarding is set to the CONSUMER instruction.
Completion of Forwarding
To pass the value from the STORE to the CONSUMER locally in the cluster, a small-sized "Local Forward Buffer" is added to the data path of each cluster. The LFB is made with a CAM, and it contains the STORE value associated with the STORE tag. When DSMF is applied, the CON-SUMER instruction can obtain the operand directly from the LFB.
The following processing is performed when the STORE, LOAD, and CONSUMER instructions that hold the forwarding tag are executed.
The STORE instruction writes the value to the LFB associated with the tag. This operation is performed in parallel with the address calculation after the STORE value is read from the register. The oldest entry in the LFB is overwritten when the LFB is full.
After the completion of the normal cache access, the LOAD instruction writes the result to the queue for verification. An entry of the queue holds the STORE tag and the result of the LOAD. The queue is read at a rate of one entry per cycle, and the speculation is verified.
The CONSUMER instruction reads the value from the LFB at the time of register reading. The value associated with the STORE tag is obtained and used as the source operand of the CONSUMER. Figure4 shows the application rate of the proposed techiniques. Varying the size of LFB, we examined how often the CONSUMER could get the operand from LFB. Application rate grows with the size of LFB, and it saturates around the size of 64 entries. We ascertained that application rate is around 25% of all the CONSUMER instructions. 
Evaluation
Related Works
Research on the memory dependence prediction is discussed in [3] , and the speculative memory forwarding techniques in the conventional centralized microarchitectures are discussed in [9, 7] .
Palacharla et al. [8] showed that increasing the execution width will increase the effect of wire delay in advanced process technologies. They also discussed the effectiveness of clustered microarchitectures. The DEC Alpha 21264 processor [6] has a clustered data path that consists of two "subclusteres". Canal et al. [2] proposed the clustered superscalar architecture that maintains binary compatibility by adding the dynamic instruction assignment logic (steering) to the rename stage.
Among the studies of cache structures for clustered microarchitectures, Zyuban et al. [10] proposed a cache partitioning scheme that divided the caches by memory address. Recently, Gonzalez et. al [4] evaluated some cache partitioning structures and steering schemes for load instructions.
Conclusion
In this paper, we discussed the optimization of the memory referencing process in a clustered microarchitecture. In order to optimize this process and reduce communication between clusters, it is necessary to know the memory dependence in advance. Only the application of a prediction technique can actually provide information on memory dependence. We proposed an optimization technique based on a memory dependence predictor, and examined the effect of such a speculative optimization.
The proposed technique, DSMF, is superior to simple cache partitioning methods in terms of steering and scheduling. We obtained the experimental result that DSMF is applied to about 30% of all CONSUMER instructions although DSMF is limited by the accuracy of the memory dependence predictor. We also showed that an IPC improvement of 15% was obtained by this localization on the assumed baseline processor. Further optimization of the steering scheme can be carried out when DSMF is introduced.
The research presented in this paper reveals a new effect of localized memory dependence on clusters.
