Abstract-Multicore chips are currently dominating the microprocessor market as designs that improve performance and sustain power consumption. However, complex core features must be still considered to provide good performance for existing sequential applications. An effective approach to reduce core complexity without dramatically sacrificing performance is to distribute critical processor structures by using clustered microarchitectures. In these designs, communication latency among clusters is a critical performance bottleneck, and a good steering algorithm is required to reduce intercluster communication. In this paper, we propose a new energy-efficient microarchitectural approach that reduces intercluster communication by detecting and generating independent chains of instructions, referred to as subtraces, from the execution of sequential programs. The devised mechanism has been modeled on an x86-based trace-cache processor, where subtraces are built in the fill unit, stored in a trace cache, and individually steered to different clusters. Experimental results show that the proposal reaches performance speedups around 7 and 15 percent for point-topoint and bus-based interconnects, respectively, while achieving energy savings of up to 12 percent.
D
URING nearly the last two decades, performance of superscalar microprocessors has been improved by exploiting ILP through increasingly more aggressive mechanisms that maintain binary compatibility. However, continuous shrinking of the transistor size causes an increase of power density, while performance does not rise at the same pace. Microprocessor industry has moved to multicore processors in order to trade off power consumption and global performance. In these processors, the number of cores, as well as their individual complexity, widely differs among industry products, which range from simple in-order execution cores [1] to complex out-of-order execution processors implementing simultaneous multithreading [2] .
Though most recent multicore-related research focuses on a large number of cores, industry is still providing products with a rather low number of cores. The main reason for this situation is the limited software and operating system scalability, as well as the fact that most current applications are designed with traditional sequential programming techniques.
This work focuses on a new microarchitectural approach to extract parallelism at runtime from sequential applications. To this end, we concentrate on clustered microarchitectures, which were proposed to reduce the complexity of nonscalable structures in superscalar processors [3] . In these architectures, each cluster contains its own instruction queue (IQ), register file, and functional units, whereas a common processor front end (fetch, decode, and renaming logic) is shared among them. Since global complexity is reduced, a clustered architecture is suitable both for monolithic and multicore processors.
After an instruction is renamed, a steering algorithm decides the target cluster for that instruction. Then, if a value is consumed by a cluster other than its producer, a copy instruction is artificially generated by the steering logic, and inserted into the ROB and the issue queue of the producer cluster. Copy instructions are issued to a network connecting all clusters, and their execution time depends on the interconnect architecture and the distance between the source and destination clusters. The intercluster communication latency has a critical impact on global performance [4] , and thus, keeping the number of copy instructions as low as possible becomes a major design concern. While sophisticated steering algorithms have been designed for this aim, this bottleneck, as shown in this paper, can still be further reduced.
To this end, our proposal first aims at dynamically generating independent chains of instructions (subtraces) out of traces of sequential code, which are then steered to different clusters. Subtraces are generated by analyzing and splitting a sequence of committed instructions. Then, individual instructions are replicated in several subtraces, until they become completely independent from each other. Hereby, additional parallelism is artificially induced as long as it helps further alleviate the intercluster communication bottleneck. The proposed mechanism has been evaluated on top a clustered trace-cache x86 microprocessor model, where the trace cache fill unit has been tailored to detect and construct independent subtraces, after the commit stage and out of the critical path. This information is then reused by the steering logic, which might insert instruction replicas into several clusters. Experimental results show a considerable reduction of copy instructions, which leads to average performance speedups between 3 and 15 percent for different bus-based interconnects, and between 3 and 7 percent for the evaluated point-to-point networks, while still reducing the global dissipated energy by up to 12 percent.
The remainder of this paper is organized as follows. Section 2 describes the baseline clustered architecture. Section 3 introduces the proposed algorithm, whose hardware implementation is described in Section 4. Sections 5 and 6 show an experimental evaluation of performance and power, respectively, and Sections 7 and 8 present some related work and concluding remarks.
MICROARCHITECTURE OVERVIEW
This section presents the baseline clustered architecture used to implement and evaluate our proposal. To guarantee that our reported performance gains add up on existing common architectural improvements, a sophisticated baseline design has been modeled, using a trace cache, an advanced steering algorithm, and nontrivial interconnection network topologies. These features have been proposed in previous research, and are summarized in this section to aid in the understanding of the rest of the paper.
The baseline architecture is a superscalar, singlethreaded, clustered processor, whose block diagram is represented in Fig. 1 . The processor front end fetches x86 macroinstructions, decodes them, and dumps the generated microinstructions (-ops) into a -op queue (Fig. 1a) . Then, uops are dispatched into a shared ROB, which tracks their global program order until they commit. Memory uops reserve an entry in a global LSQ, while the rest of them are steered to the clusters and reserve a local IQ entry. The LSQ implements the load bypassing and load forwarding optimization techniques, and its shared design guarantees a global ordering of memory accesses. When a nonspeculative uop at the ROB head completes, it is processed by the fill unit. This component builds a temporary trace that is eventually sent back to the trace cache.
After an uop is dispatched, a steering algorithm decides its target cluster. When an arithmetic uop is steered into a cluster lacking any input operand, a copy uop is generated and inserted into the IQ of some of the clusters containing the required operand. A shared register alias table (RAT) stores the register mappings for each cluster; it is indexed by a logical register identifier, and returns the private physical register associated in each cluster, or a void label if the value is not present for that cluster. Finally, each cluster contains a private IQ, functional unit pool, and physical register file (Fig. 1b) .
The processor front end of the baseline architecture uses both a trace cache and an instruction cache (Fig. 1c) . The trace cache contains sequences of predecoded uops, and on a hit, the contents of the accessed line are directly copied into the uop queue. The trace cache is looked up in parallel with the instruction cache. On a trace cache miss, the fetched instruction cache line is decoded, and the generated uops are then inserted into the uop queue. A successful access to the trace cache has two main benefits: on one hand, the decode latency is avoided; on the other hand, a taken branch does not prevent subsequent uops from being fetched in the same cycle. Thus, a trace cache based front end is an effective solution to increase fetch bandwidth, especially in architectures implementing a CISC instruction set, like the x86 ISA.
Trace Cache
The trace cache is indexed by the program counter (eip register) and a sequence of bits representing the predicted behavior of the next branches. On a hit, the trace cache returns a sequence of predecoded uops that can be directly dispatched without further processing. This model relies on a multiple branch direction prediction, that is, a branch predictor with the capability of providing in a single cycle several predictions, each based on the original program counter and the previous prediction [5] .
The trace cache is organized as a set-associative structure storing trace lines. The implementation used through the experiments is based on the original proposal by Rotenberg et al. [6] , where the fields of each trace cache line are the listed in Table 1 . Traces of nonspeculative uops are stored after the commit stage in the fill unit, which is a temporary buffer of the same size as the maximum trace size. When a trace is full, a new trace line is allocated, replaced, or updated in the trace cache, and the contents of the fill unit are copied into it. It is possible to obtain final traces smaller than the maximum trace size for the following reasons: on one hand, the number of branches within a trace is limited (mainly by the multiple branch predictor width); on the other hand, uops belonging to the same x86 macroinstruction are not allowed to span several traces. Thus, the fill unit might be drained before its maximum occupancy is reached.
Steering Algorithm
A major design issue of a clustered microarchitecture is the steering algorithm. After dispatching arithmetic instructions, the steering algorithm decides which cluster they are sent to, aiming at tradingoff workload balance among clusters and intercluster communication for input dependences satisfaction. Likewise, the steering algorithm decides from which cluster to copy an uop's input operand in case it is not present. In the modeled clustered architectures, a slightly modified version of the topology-aware steering algorithm [7] is implemented, which works as follows.
Before steering an uop, an initial set of candidates is created including all clusters. This set is progressively reduced by applying four successive filters and discarding inadequate candidates, and finally a random cluster is selected among the resulting set. The actions performed by each filter are listed in Table 2 . Filters 1 and 4 aim at balancing the workload among clusters, while filters 2 and 3 try to reduce the communication latency due to absent input operands.
Regarding the workload balance, each cluster tracks the number of instructions dispatched to it (di i ), and the average of these counters is updated globally (di avg ). Then, each cluster computes a private workload counter as the deviation of dispatched instructions d i À di avg , and a global imbalance counter is calculated as the maximum absolute value of the workload counters. Filter 1 in Table 2 is applied when imbalance exceeds a threshold, which has been empirically set as the number of clusters multiplied by 8. Regarding uop input dependences, an operand is said to be present in a cluster when there is a physical register associated with it. The operand is said to be available when it is present and the instruction producing its value has completed.
Notice that topology-aware steering is a sophisticated algorithm that effectively lowers the communication among clusters, while still keeping a fair workload balance. Since the technique proposed in this paper aims at further reducing inter-cluster communication, we have opted for a powerful baseline steering for fair comparison. Less costly and efficient steering algorithms would lead to higher potential in communication reduction and higher speedups for our proposal.
Interconnection Network
Copy uops are generated in the dispatch stage when a logical register is not mapped in a cluster where it is consumed. These uops are handled as arithmetic instructions, except that they are scheduled to the interconnection network instead of a functional unit, and their source and destination physical registers are placed in different clusters. The interconnect can be viewed as a black box whose interface is formed by an injection/ejection link connected to each cluster. Once in the instruction queue of the source cluster, a copy uop is woken up when the source register is ready and the injection link is available. The interconnection network will forward the packet to the destination cluster, and eject the contents into its register file through a dedicated write port.
In general, the communication latency of a message depends on the network topology, routing algorithm, and switching mechanism. Since intercluster communication occurs on chip, links can be fairly assumed as wide as the message size, which includes a 32-bit value, a 3-bit target cluster identifier (for an 8-cluster architecture), and a 6-bit target register identifier (for a 64-entry register file). As messages are very short, packet switching is used. The communication latency can be computed as lat 0 þ lat cont ¼ ðT r þ T l Þd þ lat cont , where lat 0 is the zero-load latency or the time required to forward the message to the destination without contention, lat cont is the total contention delay due to packet collisions, T r is the routing and forwarding time, T l is the link transmission time (including its arbitration when it is shared), and d is the number of traversed links.
The lat cont component is an unpredictable value computed through simulation, while lat 0 depends on the Fig. 2 represents the block diagrams of the evaluated topologies for a 4-cluster processor.
. In a bus topology ( , and T l depends on the number of connected nodes. Again, two, four, and eight cycles are considered for lat 0 . . The 2D mesh (Fig. 2c ) requires a router attached to each cluster, which applies the XY routing algorithm after a delay of T r cycles. The link delay (T l ) can be very low, since only point-to-point connections are used. In the largest evaluated mesh (4 Â 2 for eight clusters), d takes a maximum value of 4 hops. In this case, 2d and 4d cycles are considered for lat 0 , which corresponds to T r þ T l ¼ 2 and T r þ T l ¼ 4, respectively. When connecting three, five, and seven clusters with a mesh topology, 2 Â 2, 3 Â 2, and 4 Â 2 meshes are assumed, respectively, where one of the end links is left disconnected in each case.
corresponding to a subtrace, whose superposition contains all arcs and vertexes of the original graph. Subtraces are generated in such a way that all input dependences are satisfied for each instruction, at the expense of probably executing some instructions in both subtraces.
Proposed Algorithm
The proposed algorithm consists of two phases implemented in the fill unit, which is placed after the commit stage in a superscalar processor pipeline, and out of the critical path.
After processing a trace Tr of N committed instructions, it is split into c independent subtraces (ST r 0 , ST r 1 , etc.) of maximum size N, where c is the number of clusters. Fig. 4 represents an example using two clusters and the flow of instructions shown in Fig. 3 .
. Phase 1. Selection of subtrace. As shown in Fig. 4a , instructions from trace Tr are first split individually among subtraces ST r 0 and ST r 1 . When an uop commits, it is assigned to that subtrace containing its input operands. If the input operands are either contained in different subtraces, or there is no subtrace containing them, then the least loaded subtrace is chosen. Likewise, if a given imbalance among subtraces is reached, the least loaded subtrace is chosen and the presence of input operands is ignored. Similarly to the steering algorithm, the subtrace imbalance counter is based on subtrace length deviations (see Section 2.2), and the optimal subtrace imbalance threshold has been found to be equal to 8 in this case. . Phase 2. Satisfying dependences. After committing N instructions, subtraces are processed so as to make them independent from each other. To this end, the input dependences of each instruction are satisfied by inserting their producers into the same subtrace, which might cause producer instructions to be replicated in several subtraces. As shown in Fig. 4b , subtraces are processed in this phase from left to right, i.e., from the youngest to the oldest instruction.
Thus, each new instruction placed in subtrace ST r x will cause all its older producers to be recursively inserted into ST r x as well. In the example, the processing of instruction F makes A be inserted into ST r 1 , and the processing of C makes its producer B be inserted into ST r 0 . Notice that memory instructions are excluded from replications, since the LSQ is a global structure. For simplicity, load and store instructions are treated equally as arithmetic instructions in the subtrace generation process and the subtrace storage in the trace cache, but the replication information attached to them is ignored after they are dispatched. One single copy of every memory instruction is inserted into the global LSQ, and thus no extra pressure is incurred on the memory hierarchy when the number of clusters or the instruction replication increases.
HARDWARE IMPLEMENTATION

Instruction Numbering
To identify instruction dependences after the commit stage, an instruction numbering mechanism is proposed, which labels committed instructions with continuous identifiers as per program order, and stores for each instruction the identifiers of its input dependences. These identifiers are called sequence numbers and are denoted as I seq or D seq , where I is an instruction and D is one of its input dependences. Fig. 5 represents the process of sequence number assignment. Each ROB entry contains three sequence numbers, one for the contained instructions and two for its input dependences. Likewise, each register file entry holds the sequence number of the instruction that produced the associated value. A global counter seqctr is increased and assigned to I seq when instruction I is dispatched. When I is issued, D1 seq and D2 seq are read from the register file jointly with the source operand values, and I seq is written into the destination physical register when the operation finishes.
When I commits, all sequence numbers are available to be used in the fill unit. If I is a mispredicted branch, the content of the ROB is squashed, and seqctr is set to I seq þ 1. By assigning sequence numbers with this algorithm, a continuous range of instruction sequence numbers is guaranteed, which has the convenient property of providing a direct correspondence between instructions and their position within the fill unit (thus avoiding associative ports for searches).
Fill Unit
In a trace cache processor, the fill unit is a hardware structure placed after the commit stage aimed at reconstructing a trace of committed uops and storing it in a new trace cache line. In the proposed architecture, the fill unit is additionally in charge of implementing the subtrace generation algorithm. This hardware structure consists of an N-entry buffer (N is the maximum trace length). Each buffer entry contains three fields: the associated instruction bits, a c-entry bitmap representing the presence of an instruction in each subtrace (c is the number of clusters), and the sequence numbers of the instruction and its input dependences.
The algorithm phases are implemented with two different operations in the fill unit, referred to as fill-up and emptying processes, respectively. In the former, committing instructions are assigned to one single subtrace each, while in the latter, subtraces are made independent by replicating instructions to satisfy the required input operands. Each process has an associated pointer, referred to as fillhead and emptyhead, initially pointing to buffer positions 0 and N À 1, respectively. Below, the algorithm implementation is described using the example shown in Fig. 6 .
Initially, the fill unit starts the fill-up process with the first N instructions extracted from the reorder buffer (ROB) at the commit stage (Fig. 6a) . The initial trace is referred to as trace A, and associated instructions are represented as I A . When I A commits, it is inserted into the fill unit at the position pointed to by fillhead, and this pointer is incremented. In the associated entry, the c-entry bitmap is updated by activating the bit corresponding to the initial subtrace assignment.
Once the buffer fills up (Fig. 6b) , trace A is completely held in the fill unit. At this time, phase 1 of the subtrace generation algorithm is complete for trace A, and each entry's bitmap associates one instruction with one single subtrace. The emptying process is now activated for trace A, and the fill-up process starts for trace B, which is formed of the next N instructions I B placed in the ROB. From now on, both the fillhead and emptyhead pointers are moved from right to left, i.e., decreased.
In a subsequent intermediate state (Fig. 6c) , the fill-up process continues for trace B, in which instructions I B are taken from the ROB and placed into the fill unit. At the same time, the emptying process works on trace A, implementing phase 2 of the subtrace generation algorithm. For each emptied instruction I A , the stored sequence numbers of its input dependences D A are looked up. Then, if the buffer entries pointed by these sequence numbers are valid (i.e., are within the emptying range), instruction D A is replicated in all subtraces where its producer I A is present. For this aim, the c-entry bitmap of D A is updated by activating those bits which are set in I A 's bitmap (or operation). Each instruction extracted from the fill unit is sent to the trace cache, where the generated independent subtraces are stored.
When trace A finishes the emptying process (Fig. 6d) , phase 2 of the subtrace generation algorithm is complete for this trace, and the c generated independent subtraces are stored into the corresponding trace cache line. When additionally the fill-up process for trace B completes, the direction of the fillhead and emptyhead pointers is switched again. The fill unit starts the emptying process for trace B, and trace C of instructions I C waiting at the ROB head begin to enter the fill unit from left to right.
EXPERIMENTAL EVALUATION
This section presents a performance evaluation of the proposed techniques. Experiments have been carried out on top of the Multi2Sim 2.2 simulation framework [8] , a cycle-accurate simulator for x86-based superscalar processors, modified to model a clustered architecture, intercluster network topologies, and independent subtraces generation. The simulator accurately tracks the processor pipeline state cycle by cycle, as well as the memory hierarchy including a trace cache. The parameters of the modeled machine are summarized in Table 3 .
The Mediabench [9] suite has been used as a workload to evaluate the devised techniques. These applications include image and video processing, audio encoding, or speech recognition, among others. The Mediabench suite provides a particular potential for dynamic extraction of parallelism, since it includes extremely parallel algorithms whose implementation is based on a traditional sequential programming model. This is in contrast to the SPEC CPU benchmarks, which do not provide high amounts of intrinsic parallelism, or to SPLASH/Parsec benchmarks, which exploit their parallelism mostly at compile time using the pthreads programming model. The presented results include partial program executions, where simulations are stopped after the first 100 million uops commit. Fig. 7 shows a performance study of the proposed technique, including performance results for the baseline architecture (Fig. 7a) , performance for a clustered processor with automatic generation of subtraces (Fig. 7b) , and the resulting performance speedup (Fig. 7c) . The number of clusters ranges from 2 to 8, and the evaluated intercluster networks are a bus (with two, four, and eight cycles for lat 0 , see Section 2.3), a crossbar (or n-buses with two, four, and eight cycles for lat 0 ), and a mesh (with two and four cycles lat 0 ). 1 Each bar represents the average speedups for the whole benchmark suite (16 workloads).
Performance Evaluation
. In a bus topology, the fastest configuration (lat 0 ¼ 2)
provides speedups below 5 percent. However, a fast bus is only suitable for a very low number of connected clusters. When the zero-load latency value is increased to 4, a 4-cluster configuration provides a speedup greater than 10 percent, while an 8-cycle latency value makes subtraces outperform the baseline machine by more than 15 percent for some configurations. When the communication latency increases, a reduction of copy instructions has a stronger benefit on performance. . Crossbar topologies eliminate packet collisions when different pairs of clusters communicate, which improves performance of both the baseline and proposed designs. Since copy instructions do not incur such a high penalty, speedups decrease. However, the hardware cost of a crossbar grows quadratically with the number of clusters. Thus, this topology might be unfeasible for a high number of clusters. . Finally, a mesh can afford shorter point-to-point link delays. The main communication latency occurs in this case when distant clusters communicate and need to traverse several routers. A mesh with a 4-cycle zero-load latency provides speedups above 5 percent, regardless of the number of clusters. To provide some insight into each specific benchmark's behavior, performance has been evaluated individually for each benchmark for some specific configurations, as plotted in Fig. 8 . Shorthand c2-bus2-base stands for two clusters/ 2-cycle bus/baseline machine; c8-bus8-subtr means eight clusters/8-cycle bus/machine with subtraces generation; etc. The bus latency values chosen for this plot intend to be representative for the corresponding number of clusters in each configuration, also resembling those used in previous works on clustered microarchitectures [7] .
Performance values are given as a global IPC, i.e., number of instructions per cycle committed from the global ROB. Notice that an increase in the number of clusters might result in an IPC loss in some cases (Figs. 7 and 8), which under identical circumstances would negate the reason for using more than one cluster. However, clustering the major microprocessor components (e.g., register files or IQs) helps for a reduction of the clock cycle time, which should result in lower global execution time. The analysis of clustered microarchitectures and their benefits have been widely studied in the literature and are now out of the scope of this work. Here, we assume these benefits and focus on the additional performance gains obtained from subtraces.
Copy Instructions and Replicated uops
When independent subtraces are dispatched, the number of copy uops decreases, reducing communication and network contention. Instead, uop replicas are dispatched, which are usually faster (two cycles for very frequent integer additions or effective address computations), and cause dependent instructions in the IQ to be issued earlier. Fig. 9 shows the number of copy instructions (copy_base for the baseline machine and copy_subtr for the machine implementing independent subtraces) and the number of replicated uops, represented as an average fraction over the total number of committed uops.
These values have been obtained for a bus2 network, though the interconnect topology does not considerably affect the aspect of the plotted curves. When an uop at the ROB head with several replicas commits, one of them is tracked as the original uop, and the rest of them are counted as replicas. Results show that the number of copy uops decreases by about 10 percent in any configuration. On the other hand, uop replicas range from 10 to 20 percent of the total committed uops, depending on the number of clusters.
The results presented in this section partially explain the performance speedups shown in Fig. 7 . Since intercluster communication acts most of the time as a performance bottleneck, execution time is reduced according to the decrease in copy instructions. Thus, the points that show a shorter distance between the copy_base and copy_subtr curves correspond to the highest performance speedups.
Impact of the Trace Size
We have measured the impact of different trace sizes, ranging from 8 to 64 uops. Fig. 10 shows the speedup achieved by the proposed technique executed on top of a 4-cluster processor model with the bus4 interconnect topology. For the sake of clarity, only three benchmarks are plotted, including the average curve for the whole Mediabench suite. The optimal trace size is observed at different positions for each benchmark, mostly depending on the parallelism exhibited by each.
The trace size has two main implications on performance. On one hand, a larger trace increases the chance of extracting existing parallelism within a given trace with a lower uop replication rate. On the other hand, a trace cache with the same number of larger cache lines makes the hit ratio shrink dramatically. In our benchmark suite, the average fraction of committed uops fetched from the trace cache is 57.9, 52.2, 41.7, and 27.3 percent for 8, 16, 32, and 64-entry traces, respectively. The Average plotted curve represents the whole Mediabench suite, and shows an optimal average trace size of 16 entries (used for our baseline machine).
Impact of the Sequence Number Length
During the emptying process implemented in the fill unit, each instruction I is extracted and dumped into the trace cache. For each dependent instruction D1 and D2 present in the fill unit, bitmaps are updated to determine their future presence in subtraces. The index of D1 and D2 are determined with sequence numbers attached to I, whose length should be determined as a tradeoff between performance and hardware cost. If the sequence numbers are too short, they might generate aliasing by detecting false dependences. On the contrary, too large sequence numbers increase the fill unit entry size. In any case, sequence number aliasing causes false dependence detections, but never leaves a real dependence undetected. Since the sequence number is used to index the N-entry fill unit, the sequence number size should not be lower than log 2 N bits.
When false dependencies are introduced, independent subtraces become larger than necessary, and redundant computations are performed across clusters. In Fig. 11 , the average subtrace size is represented for the studied benchmarks, varying the number of clusters between two and eight, and using original traces of N ¼ 32 -ops. On one hand, it is observed that a higher number of clusters (subtraces) provides a smaller average subtrace length. The reason is that traces have the means to be further split into several independent subtraces and still reduce their length by exploiting their inner ILP. On the other hand, results show that a sequence number length of 8 bits almost completely palliates false dependences detection, since the average subtrace length is very similar to an ideal design with an unbounded sequence number length. memory hierarchy, which is based on the Cacti code [11] to compute statistics related to data arrays and CAM structures. McPAT provides detailed power consumption statistics for each hardware component, including subthreshold leakage, gate leakage, and runtime dynamic power, based on the hardware complexity of the computed designs and their access rates. The tool has been set up to model our baseline and proposed schemes for a 45 nm technology and a working frequency of 3.4 GHz, and it has been extended with the following features:
POWER AND ENERGY STUDY
. A multicluster architecture model has been added to
McPAT by replicating those structures private per cluster (register files, instruction queues, functional units, and results broadcast buses). The hardware characteristics and access statistics of these components are provided by Multi2Sim and summed up to obtain the total power consumption. By using the configurable NoC implementation provided by McPAT, the intercluster network has been modeled and included in the global power results. . Relying on the Cacti code, two trace cache models have been added to McPAT: a baseline trace cache model with the cache line fields presented in Table 1 , and the proposed trace cache model where an additional bitmap is attached to every uop indicating the presence in subtraces. Both trace cache models have a capacity of 256 traces, a 4-way associativity, and a trace size of 16 uops. We have made sure that the incurred hardware overhead does not impact the global processor cycle time. . Likewise, Cacti has been used to obtain two fill unit models, accepting as many instructions per cycle from the ROB as the commit width w. The baseline fill unit has been modeled as a 16-entry buffer with w write and w read ports. This complexity increases for the proposed fill unit: for the fill-up process, w write ports are required in the buffer, indexed by the fillhead pointer; for the emptying process, w read ports are used to extract w instructions at the location pointed by emptyhead. Moreover, 2w additional write ports are required to update the presence of at most two input dependences per instruction. In total, the proposed fill unit is implemented with w read þ3w write ports. Each fill unit entry is c þ 56 bits large (32 uop bits þ three 8-bit sequence numbers þ c-entry bitmap), being c the number of clusters. Table 4 shows detailed power consumption values for a specific processor configuration with four clusters and a bus-4 network topology. For both the baseline and the proposed machines, the columns include leakage power, dynamic power, and the total energy dissipated on average for the execution of the Mediabench suite. Each row represents a set of hardware structures, including the processor front end (BTB, branch predictor, decoder, uop queue), the memory subsystem (TLB, data cache, instruction cache, L2 cache, LSQ), traces support (trace cache, fill unit), register renaming structures (front-end RAT, retirement RAT, free list), intercluster network, register files, instruction queues, and functional units (including integer and floating-point ALUs, and result broadcast buses).
Detailed Power Consumption
As observed, the column representing the leakage power does not vary for the baseline and proposed machines, except for the hardware devoted to subtraces generation. The joint leakage power consumption of the trace cache and fill unit (traces support) increases by 10.1 percent in the proposed scheme. The dynamic power increases for all processor components due to a higher transistor switching activity, except for the intercluster network, whose activity is lowered by a reduction of the number of copy instructions. In total, the dynamic power consumption increases by 6.5 percent. However, this increase is compensated by a reduction of the execution time, which leads to a global energy saving of 6.7 percent (bottom-right cell in the table).
Energy Savings
The behavior exhibited by the specific architecture presented in Table 4 is representative for all evaluated designs. There is a tight relationship between the speedups shown in Fig. 7 and the energy savings. Though some components suffer a higher leakage consumption and dynamic energy per access in a processor with automatic subtraces generation, the execution time is reduced in such a way that the total spent energy in the processor core decreases (up to 12.2 percent in some configurations). There are some cases with small speedups that show an energy saving close to 0 percent (n-Buses-2 network), but there is no simulation showing a higher energy dissipation for the proposed scheme compared to its homologous baseline processor, showing the proposed architecture as an energyefficient approach.
Area and Timing
The Cacti tool [11] has been used to measure the area (Fig. 13a) and timing (Fig. 13b) overheads incurred by the subtraces support in the fill unit and trace cache for processors with a different number of clusters. While the baseline fill unit occupies about 0:005 mm 2 using a 45 nm technology, the proposed fill unit multiplies this area by a factor of 5.5. This considerable increase is due to the additional write ports, but it remains a negligible fraction of the total processor area (less than 0.01 percent). The fill unit access time increases by about 40 percent, but it is still low enough to complete in one single cycle for the 3.4 GHz modeled processor. Additionally, there is a slight Regarding the trace cache, the area and access time increase incurred by the proposed design is induced by the bit mask attached per uop in every stored trace. In this case, the number of clusters has a noticeable impact on the proposed trace cache area and access time, because there is a considerable number of uops affected by the associated bit mask size. The area overhead ranges from 3.1 percent (two clusters) to 12.6 percent (eight clusters), while the access time increase lies between 2.5 percent (two clusters) and 10.4 percent (eight clusters). However, the trace cache access time does not exceed three cycles even for the worst case.
RELATED WORK
The fact that performance in a clustered architecture is very sensitive to the intercluster communication latency has been widely discussed in previous research works [4] , [12] . Likewise, a large amount of research has focused on steering algorithms trying to balance clusters' workload, while at the same time minimizing the number of copy instructions [13] , [14] , [12] , [7] .
In [13] , Baniasadi and Moshovos propose several steering heuristics, classified as static and adaptive. They propose a relatively simple method that offers a competitive performance by just changing the target cluster every three instructions. The work by Canal et al. [14] focuses on dynamic runtime heuristics on heterogeneous two-cluster processors, where only one of the clusters is able to execute floating-point instructions. In [12] , Parcerisa and González show how to reduce both communication and workload imbalance by applying value prediction. Finally, the topology-aware steering, presented in [7] and adopted for the present work, considers complex networks and a higher number of clusters. However, all these proposals are constrained by the ILP present in the executed sequential code, which imposes a lower bound in the number of generated copy instructions for a given workload balance.
On the other hand, the trace cache was proposed as an effective solution to fulfill high fetch bandwidth requirements in superscalar [6] and clustered processors [15] , [16] , and was implemented in the Intel Celeron, Pentium 4, Pentium D, and Xeon microprocessor families [17] . The trace cache fill unit is a latency-tolerant component [18] , [19] , that provides a good chance for new optimizations. In [15] , a trace preprocessing is proposed in the fill unit, used later by the steering logic to dispatch dependent instructions of the same trace to the same cluster. In [16] , this analysis is improved by spotting intertrace dependences in order to place dependent instructions from different traces in the same cluster. Unlike this work, none of these optimizations consider a replication of instructions to further overcome communication delays.
The idea of introducing artificial parallelism through replication has been also employed by Madriles et al. [20] . In this case, additional TLP (thread-level parallelism) is induced by the compiler, which replicates basic blocks to create speculative threads. These threads are created by first representing data-and control-dependent basic blocks with a directed graph, and then applying the multilevel graph partitioning algorithm on it [21] . An architectural design is also proposed to undo the execution of mispredicted threads.
Compared to previous works, our proposal provides three main advantages: 1) it provides binary compatibility with existing sequential code, since it neither involves the compiler nor modifies the ISA, 2) the subtraces generation algorithm is lightweight enough to be implemented in hardware without incurring extra latency on the critical path, and 3) there is no need for an additional recovery mechanism, since mispredicted subtraces are not handled exceptionally. In summary, to the best of our knowledge, our proposal is the first fully hardware-based, binarycompatible approach that generates parallel subtraces out of sequential code at the instruction level.
CONCLUSIONS
This paper has presented an energy-efficient hardware mechanism that automatically detects independent subtraces of instructions in a sequential program. An implementation of the mechanism on top of a clustered microarchitecture has been devised, using a trace cache with a modified fill unit. By carefully replicating the execution of specific instructions, intercluster communication is reduced, network traffic and collisions are decreased, and global performance is benefited. When designing a clustered architecture that boosts single-thread performance at lower hardware costs, additional benefits are shown to be reached with subtraces generation for different number of clusters and interconnect topologies.
Experimental results show average performance speedups of about 3, 7, and 15 percent (accompanied by 2, 6, and 10 percent average energy savings) for the bus-based interconnects with two, four, and eight transmission cycles, respectively. The evaluated point-to-point networks provide average performance speedups of about 3 and 7 percent (with 2 and 5 percent average energy savings) for two-and four-cycle link delays, respectively. While higher levels of performance can be achieved running multithreaded applications on multicore processors or dataparallel programs on GPUs, they necessarily go through a recoding process following a less intuitive parallel programming models. Our proposal takes especial advantage of those highly parallel applications implemented under a sequential programming model, whose intrinsic parallelism was not completely exploited at compile time.
As future work, we plan to extend the proposed technique to a multicore environment, where a reduction of intercore communication may benefit the memory hierarchy and coherence actions. In this environment, an alternative hardware implementation should be proposed, which does not rely on a shared processor front end with a common trace cache.
Rafael Ubal received the PhD degree in computer engineering in 2010 from the Universidad Polité cnica de Valencia, Spain. He is currently a lecturer in the Electrical and Computer Engineering Department at Northeastern University, Boston, Massachusetts, and postdoctoral associate researcher in the NUCAR group conducted by Professor David Kaeli. His research topics of interest include power-aware cache designs, automatic parallelization of code, clustered/multithreaded/multicore architectures, and GPGPU. He is the main developer of the Multi2Sim simulation framework, a CPU-GPU simulation platform for heterogeneous computing.
Julio Sahuquillo received the BS, MS, and PhD degrees in computer engineering from the Universidad Polité cnica de Valencia, Spain. Since 2002 he has been an associate professor at the Department of Computer Engineering. He has taught several courses on computer organization and architecture. He has published more than 70 refereed conference and journal papers. His research topics include multiprocessor systems, cache design, instruction-level parallelism, and power dissipation. An important part of his research has also concentrated on the web performance field, including proxy caching, web prefetching, and web workload characterization. He is a member of the IEEE Computer Society. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
