This paper examines simultaneous multithreading, a technique permitting several independent threads to issue instructions to a superscalar's multiple functional units in a single cycle. We present several models of simultaneous multithreading and compare them with alternative organizations: a wide superscalar, a fine-grain multithreaded processor, and single-chip, multiple-issue multiprocessing architectures. Our results show that both (single-threaded) superscalar and fine-grain multithreaded architectures are limited in their ability to utilize the resources of a wide-issue processor. Simultaneous multithreading has the potential to achieve 4 times the throughput of a superscalar, and double that of fine-grain multithreading. We evaluate several cache configurations made possible by this type of organization and evaluate tradeoffs between them.
Introduction
This paper examines simultaneous mr.dtithreading (SM), a technique that permits several independent threads to issue to multiple functional units each cycle. In the most general case, the binding between thread and functional unit is completely dynamic. The objective of SM is to substantially increase processor utilization in the face of both long memory latencies and limited available parallelism per thread, Simultaneous mukithreading combines the multiple-issueper-instruction features of modem superscalar processors with the latency-hiding ability of multithreaded architectures. It also inherits numerous design challenges from these architectures, e.g., achieving high register file bandwidth, supporting high memory access demands, meeting large forwarding requirements, and scheduling instructions onto functional units. In this paper, we (1) introduce several SM models, most of which limit key aspects of the complex-Permission to copy without fee all or parl of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association of Computing Machinery.To copy otherwise, or to republish, requires a fee and/or specific permission. ISCA '95, Santa Margherita Ligure Italy @ 1995 ACM 0-89791 -698 -0/95/0006 ...$3.50 ity of such a machine, (2) evaluate the performance of those models relative to superscalar and fine-grain multithreading, [24] , Sun Ul-traSparc [25] , and HP PA-8000 [26] issue up to four instructions per cycle from a single thread. Multiple instruction issue has the potential to increase performance, but is ultimately limited by instruction dependencies (i.e., the available parallelism) and long-latency operations within the single executing thread. The effects of these are shown as horizontal waste and vertical waste in Figure 1 . Multithreaded architectures, on the other hand, such as HEP [28] , Tera [3] , MASA[15] and Alewife [2] employ multiple threads with fast context switch between threads. Traditional multithreading hides memory and firnctional unit latencies, attacking vertical waste. In any one cycle, though, these architectures issue instructions from only one thread. The technique is thus limited by the amount of parallelism that can be found in a single thread in a single cycle. And as issue width increases, the ability of traditional mukithreading to utilize processor resources will decrease. Simultaneous multithreading, in contrast, attacks both horizontal and vertical waste.
This study evaluates the potential improvement, relative to wide superscalar architectures and conventional multithreaded architectures, of various simultaneous multithreading models. To place our evaluation in the context of modem superscalar processors, we simulate a base architecture derived from the 300 MHz Alpha 21164 [1 1], enhanced for wider superscalar execution; our SM architectures are extensions of that basic design, Since code scheduling is crucial on wide superscalars, we generate code using the state-of-the-art
Multiflow trace scheduling compiler [20] .
Our results show the limits of superscalar execution and traditional multithreading to increase instruction throughput in future processors. For example, we show that (1) even an 8-issue superscalar architecture fails to sustain 1.5 instructions per cycle, and (2) a fine-grain multithreaded processor (capable of switching contexts every cycle at no cost) utilizes only about 40% of a wide superscalar, regardless of the number of threads. Simultaneous multithreading, on the other hand, provides significant performance improvements in instruction throughput, and is only limited by the issue bandwidth of the processor.
A more traditional means of achieving parallelism is the con- As chip densities increase, single-chip multiprocessors will become a viable design option [7] . The simultaneous multithreaded processor and the single-chip multiprocessor are two close organizational alternatives for increasing on-chip execution resources. We compare these two approaches and show that simultaneous multithreading is potentially superior to mukiprocessing in its ability to utilize processor resources. For example, a single simultaneous multithreaded processor with 10 functional units outperforms by 24% a conventional 8-processor multiprocessor with a total of 32 functional units, when they have equal Issue bandwidth.
For this study we have speculated on the pipeline structure for a simultaneous multithreaded processor, since an implementation does not yet exist. Our architecture may therefore be optimistic in two respects: first, in the number of pipeline stages required for instruction issue; second, in the data cache access time (or load delay cycles) for a shared cache, which affects our comparisons with single-chip multiprocessors.
The likely magnitude of these effects is discussed in Sections 2.1 and 6, respectively. Our results thus serve, at the least, as an upper bound to simultaneous multithreading performance, given the other constraints of our architecture.
Real implementations may see reduced performance due to various design tradeoffs; we intend to explore these implementation issues in future work.
Previous studies have examined architectures that exhibit simultaneous multithreading through various combinations of VLIW, superscalar, and multithreading features, both analytically [34] and through simulation [16, 17, 6, 23] ; we discuss these in detail in Section 7, Our work differs and extends from that work in multiple respects: (1) the methodology, including the accuracy and detail of our simulations, the base architecture we use for comparison, the workload, and the wide-issue compiler optimization and scheduling technology;
(2) the variety of SM models we simulate; (3) our analysis of cache interactions with simultaneous multithreading; and finally, (4) in our comparison and evaluation of multiprocessing and simultaneous multithreading, This paper is organized as follows. Section 2 defines in detail our basic machine model, the workloads that we measure, and the simulation environment that we constmcted. Section 3 evaluates the performance of a single-threaded superscalar architecture; it provides motivation for the simultaneous multithlreaded approach.
Section 4 presents the performance of a range of SM architectures and compares them to the superscalar architecture, as well as a fine-grain multithreaded processor. Section 5 explores the effect of cache design alternatives on the performance of simultaneous multithreading. Section 6 compares the SM approach with conventional multiprocessor architectures. We discuss related work in Section 7, and summarize our results in Section 8,
Methodology
Our goal is to evaluate several architectural alternatives as defined in the previous section: wide superscalars, traditional multithreaded processors, simultaneous multithreaded processors, and small-scale multiple-issue multiprocessors.
To do this, we have developed a simulation environment that defines an implementation of a simultaneous multithreaded architecture; that architecture is a straightforward extension of next-generation wide superscalar processors, running a real multiprogrammed workload that is highly optimized for execution on our target machine.
Simulation Environment
Our simulator uses emulation-based instruction-level simulation, similar to Tango [8] and g88 [4] . Like g88, it features caching of partially decoded instructions for fast emulated execution.
Our simulator models the execution pipelines, the memory hierarchy (both in terms of hit rates and bandwidths), the TLBs, and the branch prediction logic of a wide superscalar processor. It is based on the Alpha AXP 21164, augmented first for wider superscalar execution and then for multithreaded execution. The model deviates from the Alpha in some respects to support increased single-stream parallelism, such as more flexible instruction issue, improved branch prediction, and larger, higher-bandwidth caches.
The typical simulated configuration contains 10 functional units of four types (four integer, two floating point, three load/store and 1 branch) and a maximum issue rate of 8 instructicms per cycle. We assume that all functional units are completely pipelined. Table 1 shows the instruction latencies used in the simuliitions, which are derived from the Alpha 21164, We assume first-and second-level on-chip caches considerably larger than on the Alpha, for two reasons. First, multithreading puts a larger strain on the cache subsystem, and second, we expect larger on-chip caches to be common in the same time-frame that simultaneous multithreading becomes viable. We also ran simulations with caches closer to cu~ent processors-we discuss these experiments as appropriate, but do not show any results. The caches ( identical for most of our execution models; only the relative speeds of thedifferent threads change. Theresults from thepriority scheme present us with some analytical advantages, as will be seen in Sec-tion 4, and the performance of the fair scheme can be extrapolated from the priority scheme results.
We do not assume any changes to the basic pipeline to accommodate simultaneous multithreading. The Alphadevotes a full pipeline stage toamange ins~ctions forissue andanotherto issue. Ifsimultaneous multithreading requires more than two pipeline stages for instruction scheduling, the primary effect would be an increase in themisprediction penalty. Wehaverun experiments that show that a one-cycle increase in the misprediction penalty would have less than a 1% impact on instruction throughput in single-threaded mode.
With 8 threads, where throughput is more tolerant of misprediction delays, the impact was less than .5%.
Workload
Our workload is the SPEC92 benchmark suite [1 O]. To gauge the raw instruction throughput achievable by multithreaded superscalar processors, we chose uniprocessor applications, assigning a distinct program to each thread. This models a parallel workload achieved by multiprogramming rather than parallel processing. In this way, throughput results are not affected by synchronization delays, inefficient parallelization, etc., effects that would make it more difficult to see the performance impact of simultaneous multithreading alone.
In By maximizing single-thread parallelism through our compilation system, we avoid overstating the increases in parallelism achieved with simultaneous multithreading, 3 Superscalar Bottlenecks:
Where Have All the Cycles Gone?
This section provides motivation for simultaneous multithreading by exposing the limits of wide superscalar execution, identifying the sources of those limitations, and bounding the potential improvement possible from specific latency-hiding techniques.
Using the base single-hardware-context machine, we measured the issue utilization, i.e., the percentage of issue slots that are tilled each cycle, for most of the SPEC benchmarks. We also recorded the cause of each empty issue slot. For example, if the next instruction cannot be scheduled in the same cycle as the current instruction, then the remaining issue slots this cycle, as well as all issue slots for idle cycles between the execution of the current instruction and the next (delayed) instruction, are assigned to the cause of the delay. such as an I tlb miss and an I cache miss, the wasted cycles are divided up appropriately. Table 3 specifies all possible sources of wasted cycles in our model, and some of the latency-hiding or latency-reducing techniques that might apply to them. Previous work [32, 5, 18] , in contrast, quantified some of these same effects by removing barriers to parallelism and measuring the resulting increases in performance.
Our results, shown in Figure 2 , demonstrate that the functional units of our wide superscalar processor are highly underutilized.
From the composite results bar on the far right, we see a utilization of only 19% (the "processor bus y" component of the composite bar of Figure 2 ), which represents an average execution of less than 1.5 instructions per cycle on our 8-issue machine. 
Simultaneous Multithreading
This section presents performance results for simultaneous muhithreaded processors. We begin by defining several machine models for simultaneous multithreading, spanning a range of hardware complexities. We then show that simultaneous multithreading provides significant performance improvement over both single-thread superscalar and fine-grain multithreaded processors, both in the limit, and also under less ambitious hardware assumptions.
The Machine Models
The following models reflect several possible design choices for a combined multithreaded, superscalar processor. The fine-grain multithreaded architecture (Figure 3(a) ) provides a maximum speedup (increase in instruction throughput) of only 2.1 over single-thread execution (from 1.5 IPC to 3.2). The graph shows that there is little advantage to adding more than four threads in this model. In fact, with four threads, the vertical waste has been reduced to less than 3%, which bounds any further gains beyond that point. This result is similar to previous studies [2, 1, 19, 14, 33, 31] for both coarse-grain and fine-grain multithreading on single-issue processors, which have concluded that multithreading is only beneficial for 2 to 5 threads. These limitations do not apply to simultaneous multithreading, however, because of its ability to exploit horizontal waste. Figures 3(b,c,d) show the advantage of the simultaneous multithreading models, which achieve maximum speedups over single-thread superscalar execution ranging from 3.2 to 4.2, with an issue rate as high as 6.3 IPC. The speedups are calculated using the full simultaneous issue, 1-thread result to represent the single-thread superscalar.
With SM, it is not necessary for any single thread to be able to utilize the entire resources of the processor in order to get maximum or near-maximum performance.
The four-issue model gets nearly the performance of the full simultaneous issue model, and even the dual-issue model is quite competitive, reaching 94% of full simultaneous issue at 8 threads. The limited connection model approaches full simultaneous issue more slowly due to its less flexible scheduling. Each of these models becomes increasingly competitive with full simultaneous issue as the ratio of threads to issue slots increases.
With the results shown in Figure 3(d) , we see the possibility of trading the number of hardware contexts against hardware complexity in other areas. For example, if we wish to execute around four instructions per cycle, we can build a four-issue or full simultaneous machine with 3 to 4 hardware contexts, a dual-issue machine with 4 contexts, a limited connection machine with 5 contexts, or a singleissue machine with 6 contexts. Tera [3] is an extreme example of trading pipeline complexity for more contexts; it has no forwarding in its pipelines and no data caches, but supports 128 hardware contexts.
The increases in processor utilization area direct result of threads dynamically sharing processor resources that would otherwise remain idle much of the time; however, sharing also has negative effects. We see (in Figure 3(c) ) the effect of competition for issue slots and functional units in the full simultaneous issue model, where the lowest priority thread (at 8 threads) runs at 55% of the speed of the highest priority thread. We can also observe the impact of sharing other system resources (caches, TLBs, branch prediction table); with full simultaneous issue, the highest priority thread, which is fairly immune to competition for issue slots and functional units, degrades significantly as more threads are added (a 35% slowdown at 8 threads). Competition for non-execution resources, then, plays nearly as significant a role in this performance region as the competition for execution resources.
Others have observed that caches are more strained by a multithreaded workload than a single-thread workload, due to a decrease in locality [21, 33, 1, 31] . Our data (not shown) pinpoints the ex-2U . . SM:FU1l Simultaneous Issue G >6-"''""''''''''''" "" "'''' '''''''''''" "'''""''''''''''''''''''''''''''''''''''''''" "'''''''''''''''''''" There are two configurations that appear to be good choices.
-G c
Because there is little performance difference at 8 threads, the cost of optimizing for a small number of threads is small, making 64s.64s an attractive option. However, if we expect to typically operate with all or most thread slots full, the 64p.64s gives the best performance in that region and is never worse than the second best performer with fewer threads. The shared data cache in this scheme allows it to take advantage of more flexible cache partitioning, while the private instruction caches make each thread less sensitive to the presence of other threads. Shared data caches also have a significant advantage in a data-sharing environment by allowing sharing at the lowest level of the data cache hierarchy without any special hardware for cache coherence. For these experiments, we tried to choose SM and MP configurations that are reasonably equivalent, although in several cases we biased in favor of the MP. For most of the comparisons we keep all or most of the following equal: the number of register sets (i.e, the number of threads for SM and the number of processors for MP), the total issue bandwidth, and the specific functional unit configuration.
A consequence of the last item is that the functional unit configuration is often optimized for the multiprocessor and represents an inefficient configuration for simultaneous multithreading. All experiments use 8 KB private instruction and data caches (per thread for SM, per processor for MP), a 256 KB 4-way set-associative shared second-level cache, and a 2 MB direct-miipped third-level cache. We want to keep the caches constant in our comparisons, and this (private I and D caches) is the most natural configuration for the multiprocessor.
We evaluate MPs with 1, 2, and 4 issues per cycle on each processor. We evaluate SM processors with 4 and 8 issues per cycle;
however we use the SM:Four Issue model (defined in Section 4.1)
for all of our SM measurements (i.e., each thread is limited to four issues per cycle). Using this model minimizes some of the inherent complexity differences between the SM and MP architectures. For example, an SM:Four Issue processor is similar to a single-threaded processor with 4 issues per cycle in terms of both the number of ports on each register file and the amount of inter-instruction dependence checking. In each experiment we run the same version of the benchmarks for both configurations (compilled for a 4-issue, 4 functional unit processor, which most closely matches the MP configuration) on both the MP and SM models; this typically favors the MP.
We must note that, while in general we have tried to bias the tests in favor of the MP, the SM results may be c~ptimistic in two respects-the amount of time required to schedule instructions onto functional units, and the shared cache access time. The impact of the former, discussed in Section 2.1, is small. The distance between the load/store units and the data cache can have a large impact on cache access time. 
Related Work
We have built on work from a large number of sources in this paper. In this section, we note previous work on instruction-level parallelism, on several traditional (coarse-grain and fine-grain) multithreaded architectures, and on two architectures (the M-Machine and the Multiscalar architecture) that have multiple contexts active simultaneously, but do not have simultaneous multithreading. We also discuss previous studies of architectures that exhibit simultaneous multithreading and contrast our work with these in particular.
The data presented in Section 3 provides a different perspective from previous studies on ILP, which remove barriers to parallelism (i.e. apply real or ideal latency-hiding techniques) and measure the resulting performance. Smith, et al., [28] focus on the effects of fetch, decoding, dependence-checking, and branch prediction limitations on ILP; Butler, ef al., [5] examine these limitations plus scheduling window size, scheduling policy, and functional unit configuration; Lam and Wilson [18] focus on the interaction of branches and ILP; and Wall [32] examines scheduling window size, branch prediction, register renaming, and aliasing.
Previous work on coarse-grain [2, 27, 31] and fine-grain [28, 3, 15, 22, 19] fine-grain threads to processors, so competition for execution resources (processors in this case) is at the level of a task rather than an individual instruction.
Hirata, et al., [16] present an architecture for a multithreaded superscalar processor and simulate its performance on a parallel ray-tracing application, They do not simulate caches or TLBs, and their architecture has no branch prediction mechanism. They show speedups as high as 5.8 over a single-threaded ;architecture when using 8 threads. Yamamoto, et al., [34] present an analytical model of multithreaded superscalarperformance, backed up by simulation.
Their study models perfect branching, perfect caches and a homogeneous workload (all threads running the same trace). They report increases in instruction throughput of 1.3 to 3 with four threads.
Keckler and Dally [17] and Prasadh and Wu [23] describe architectures that dynamically interleave operations frclm VLIW instructions onto individual functional units. Keckler and Dally report speedups as high as 3,12 for some highly parallel applications.
Prasadh and Wu also examine the register file bandwidth requirements for 4 threads scheduled in this manner. They use infinite caches and show a maximum speedup above 3 cwer single-thread execution for parallel applications. Daddis and Tomg [6] plot increases in instruction throughput as a function of the fetch bandwidth and the size of the dispatch stack. The dispatch stack is the global instruction window that issues all fetched instructions. Their system has two threads, unlimited functional units, and unlimited issue bandwidth ((but limited fetch bandwidth). They report a near doubling of throughput.
In contrast to these studies of multithreaded, superscalararchitectures, we use a heterogeneous, multiprogrammed workload based on the SPEC benchmarks; we model all sources of latency (cache, memory, TLB, branching, real instruction latencies) in detail. We also extend the previous work in evaluating a variety of models of SM execution. We look more closely at the reasons for the resulting performance and address the shared cache issue specifically.
We go beyond comparisons with single-thread processors and compare simultaneous multithreading with other relevant architectures:
fine-grain, superscalar multithreaded architectures and single-chip multiprocessors. Our results show the benefits of simultaneous multithreading when compared to the other architectures, namely:
1. Given our model, a simultaneous multithreaded architecture, properly configured, can achieve 4 times the instruction 2.
3.
throughput of a single-threaded wide superscalar with the same issue width (8 instructions per cycle, in our experiments).
While fine-grain multithreading (i.e., switching to a new thread every cycle) helps close the gap, the simultaneous multithreading architecture still outperforms fine-grain multithreading by a factor of 2. This is due to the inability of fine-grain multithreading to utilize issue slots lost due to horizontal waste.
A simultaneous multithreaded architecture is superior in performance to a multiple-issue multiprocessor, given the same total number of register sets and functional units. Moreover, achieving a specific performance goal requires fewer hardware execution resources with simultaneous multithreading.
The advantage of simultaneous multithreading, compared to the other approaches, is its ability to boost utilization by dynamically scheduling functional units among multiple threads. SM also increases hardware design flexibility; a simultaneous multithreaded architecture can tradeoff functional units, register sets, and issue bandwidth to achieve better performance, and can add resources in a fine-grained manner.
Simultaneous multithreading increases the complexity of instruction scheduling relative to superscalars, and causes shared resource contention, particularly in the memory subsystem. However, we have shown how simplified models of simultaneous multithreading reach nearly the performance of the most general SM model with complexity in key areas commensurate with that of current superscalars; we also show how properly tuning the cache organization can both increase performance and make individual threads less sensitive to multi-thread contention.
