AbstractÐAs the performance gap between processor and memory grows, memory latency becomes a major bottleneck in achieving high processor utilization. Multithreading has emerged as one of the most promising and exciting techniques used to tolerate memory latency by exploiting thread-level parallelism. The question, however, remains as to how effective multithreading is on tolerating memory latency. The performance of multithreading is not only affected by the overlapping of memory latency with useful computation, but also strongly depends on the cache behavior and the overhead of multithreading (e.g., thread management and context-switch costs). In particular, multithreading affects the behavior of caches, and, thus, the overall performance in a nontrivial fashion. To study these issues, this paper presents the Multithreaded Virtual Processor (MVP) model. MVP integrates the multithreaded programming paradigm and a modern superscalar processor with support for fast context switching and thread scheduling. Our studies with MVP show that, in general, the performance improvements are obtained not only by tolerating memory latency but also lower cache miss rates due to exploitation of data locality. However, multithreading creates an additional stress on the memory hierarchy caused by the interference among threads. Also, the dynamic behavior of multithreaded execution hinders the instruction locality that results in a high number of misses in the L1 instruction cache.
ae

INTRODUCTION
I N response to the ever-increasing demand for higher performance in computing, modern superscalar processors exploit instruction-level parallelism (ILP) by issuing multiple instructions in a single cycle. To exploit ILP, superscalar processors employ techniques such as out-oforder execution, speculative execution, and multi-level cache system. However, stalls due to cache misses severely degrade the performance by disrupting the exploitation of ILP. Recent technological trends show that the speed of commercial microprocessors has increased by a factor of 12 over the past 10 years, while the speed of memories has only doubled [8] . Thus, the memory latency in terms of processor clock cycles has grown by a factor of six in 10 years. More importantly, performance gap between processors and main memory will no doubt continue to increase in the future. Multiprocessors and multicomputers also greatly exacerbate the memory latency problem. In SMPs, contention due to the shared bus located between the processor's L2 cache and the shared main memory subsystem adds additional delay to the memory latency.
The memory latency problem becomes even more severe for scalable distributed shared memory (DSM) systems because a miss on the local memory requires a request to be issued to the remote memory and a reply to be sent back to the requesting processor. This limits the performance and scalability since the proportion of the processor time actually spent on useful work keeps diminishing as parallel machines become larger.
There are a number of techniques that effectively reduce the memory latency, such as prefetching, compiler optimizations, and multilevel caches, but they do not provide a complete solution. Therefore, the remaining latency must be tolerated. Multithreading has emerged as one of the most promising and exciting techniques to tolerate memory latency. A multithreaded system contains multiple ªloci of controlº (or threads) within a single program and provides the processor with an ability to switch between the threads, not only to hide memory latency but also other long latency operations, such as I/O and synchronization operations. The processor may also interleave instructions on a cycleby-cycle basis from multiple threads to minimize pipeline breaks due to dependencies among instructions within a single thread, leading to higher processor utilization.
To support the exploitation of thread-level parallelism (TLP), two types of architectures have been proposed: multiprocessor and multithreaded systems. Multiprocessors replicate a number of superscalar processors and provide interprocessor communication mechanism via shared-memory. Threads are statically partitioned and executed on a separate processor. Therefore, it is difficult for a multiprocessor to dynamically exploit TLP among the processors. On the other hand, multithreaded systems provide support for multiple contexts and fast context switching within the processor pipeline. This allows multiple threads to share the processor's resources by dynamically switching between threads. Therefore, multithreaded systems tend to achieve better processor utilization and, thus, improve the performance.
In light of the aforementioned discussion, this paper presents the development of the Multithreaded Virtual Processor (MVP), which exploits the synergy between the multithreaded programming paradigm and the well-designed contemporary microprocessors. MVP is a proof of concept that, by providing an adequate hardware support to an existing superscalar core, we can take full advantage of latency tolerance of multithreading. With its fast context switching and hardware scheduling mechanisms, MVP provides the capability to hide cache miss latencies. In order to evaluate the performance of the MVP and its cache effects, we developed a simulator that integrates a general purpose thread package and a multithreaded superscalar simulator. Our experiments showed that the multithreading's latency tolerating capability effectively improves the overall performance. Also, despite the fact that multithreading is known to increase the cache miss rates due to the interference among threads, our studies showed that sharing effect among threads further contributes to the performance improvement of MVP when data locality is exploited.
The paper is organized as follows: Section 2 introduces the related work on the memory latency problem and other multithreaded systems. Section 3 provides a brief description of the MVP. Section 4 describes the MVP simulation environment. Section 5 presents the simulation results along with an analytical model of MVP performance. Finally, Section 6 provides a conclusion and future research directions.
RELATED WORK
There are two techniques for dealing with the memory latency problem: latency reduction and latency tolerance. Caches are one of the most effective latency reduction methods. Therefore, much focus has been given to improving the cache performance. The performance of caches can be improved by reducing the hit-time, the miss penalty, and the miss rate. To reduce the hit-time, many modern processors incorporate on-chip caches. Multilevel caches and lock-up free caches are often used to reduce the miss penalty [23] . However, most of the research on improving the cache performance has concentrated on reducing the cache miss rate, e.g., the hardware [6] and software [4] , [13] , [17] , [19] prefetching mechanisms, compiler optimizations [7] , victim caches [11] , assist caches [14] , stream buffers [20] , etc. There are also proposals that attempt to reduce the miss rate by exploiting temporal/spatial locality in the cache. Johnson et al. introduced a hardware mechanism, called Spatial Locality Detection Table ( SLDT), that detects spatial reuse of the cached data [12] . Milutinovic et al. introduced a cache mechanism that exploits both spatial and temporal locality [18] . Their proposal separates the caches into two parts: temporal and spatial parts. The temporal part consists of small caches in hierarchy and the spatial part has no hierarchy with possibility of prefetching mechanism to the main memory. At compile time, the data are categorized as those exhibiting either predominantly temporal locality or predominantly spatial locality and a hardware mechanism reallocates the data between the two parts for better locality at run-time. All these techniques described thus far focus on reducing the memory latency and the remaining latency can be tolerated using multithreading.
The idea of multithreading is not new. Fine-grained multithreading was implicit in the dataflow model of computation. Multiple hardware contexts (i.e., register files and PSWs) to speed up switching between threads were implemented in systems such as HEP [22] and Tera [2] . However, these systems require considerable modification to the underlying architecture. There has also been an effort to integrate multithreading support on an existing processorÐMIT's Alewife machine uses a modified SPARC processor, called Sparcle [1] . However, Sparcle is based on an outdated single-issue processor architecture; therefore, it is unclear what effect multithreading will have on modern superscalar architectures. Multithreading has also been extensively used as a programming paradigm (i.e., software-controlled multithreading) on general purpose hardware to increase application's throughput and to exploit thread parallelism on SMPs. However, software-controlled multithreading systems, such as Pthreads [3] and Solaris threads [5] , lack the hardware features necessary to detect and handle cache misses and, therefore, the ability to hide memory latency.
Recently, a number of multithreaded systems based on modern superscalar architecture have been proposed [9] , [10] , [16] . For example, Eggers et al. proposed the Simultaneous Multithreading (SMT) based on MIPS R10000 with eight hardware contexts and a large number of Functional Units [16] , [24] . The philosophy behind these architectures is to improve the processor utilization by allowing implicit context-switches to occur in response to the data/resource availability, reducing pipeline stalls. However, these systems do not explicitly initiate context switching on cache misses to hide long memory latency. Therefore, we present MVP that performs context switching on cache misses in the next section.
MULTITHREADED VIRTUAL PROCESSOR
MVP augments the modern superscalar core with hardware and software support for multithreading. MVP extends the software-controlled multithreading model with multiple hardware contexts to provide latency tolerance while providing a transparent view to the programmer. Fig. 1 shows the two layers of the MVP architectureÐthe software layer and the hardware layer. The software layer provides the facilities for coding multithreaded applications using Pthreads [3] . Pthreads is a POSIX compliant thread extension that specifies a priority-driven thread model with preemptive scheduling policies. With this extension, threads are constructed from user-defined functions and a context-switch occurs during a long latency operation such as I/O operations and synchronization. Therefore, MVP is a coarse-grain multithreaded system with software support for creating, synchronizing, and scheduling threads.
The hardware layer in MVP consists of a conventional superscalar processor core augmented with the Hardware Scheduler and Multiple Hardware Contexts. Each hardware context includes a valid bit (V), a ready bit (R), a thread identifier (TID), a program counter (PC), and a Register File. The responsibility of the Hardware Scheduler is to basically maintain the control of thread states that have been scheduled onto MVP and provide fast context switching. When an L2 cache miss is detected by the Memory Management Unit (MMU), MVP initiates a hardware context-switch. After the context-switch is performed, new instructions are fetched from the memory location indicated by PC of the new hardware context. A more detailed description of the interaction between the software and hardware layers in MVP is given in [15] .
SIMULATION ENVIRONMENT
To verify the MVP concept and to study its effect on the cache performance, we developed an execution-driven functional simulator and performed simulation studies to identify the general benefit/drawback of multithreading. In particular, our experiments focused on studying the effect of multithreading-related parameters, such as number of hardware contexts and granularity of threads, rather than exhaustively studying variety of cache configurations.
Simulated Architecture
Our simulation studies were based on the following architectural parameters:
. The number of instructions fetched, decoded, and dispatched is four. The number of entries in the Reservation Stations and Reorder-Buffer (ROB) were each assumed to be 32. . Two-level cache with writeback policy and the main memory bus width of 8 bytes were assumed. Latencies of functional units, caches, and main memory were based on Table 1 . . Context switching to a new thread is initiated when an L2 cache miss is detected. L1 cache misses were not supported since the latency is assumed to be six cycles and, therefore, not worth context switching to a new thread. However, a context-switch can be initiated at any level of the memory hierarchy as long as sufficient latency exists. . The process of switching from one hardware context to another involves 1) turning off one register bank and turning on another register bank, 2) flushing the ROB, and 3) fetching from the new context. Assuming this is supported entirely in hardware, this process is very similar to recovering from a mispredicted branch and requires a penalty of three cycles. . The branch prediction scheme uses a 2K-entry Branch Target Buffer (BTB) with 2-bit branch prediction bits.
Benchmarks
Five benchmark programs were used to study the behavior of caches in MVP. Matrix Multiplication (MMT) and Gaussian Elimination (GE) programs were manually ported to be multithreaded using Pthreads library calls. Other benchmark programs, Fast Fourier Transformation (FFT), MP3D, and Radix Sort (RS) were ported from SPLASH-2 [25] . The SPLASH-2 benchmarks were originally written for shared-memory machines and uses ANL macros to create and manage threads. To port the SPLASH-2 benchmarks to the simulator, the ANL macros were replaced with their Pthreads equivalents. All programs were compiled with -O2 optimization and the original source codes were untouched. Each benchmark is briefly described below:
. MMT parses the matrix data into blocks and assigns them to threads. The data set for the threads is relatively disjoint, but the row by column operation does produce considerable overlapping of data among threads. Moreover, there is no thread intercommunication or synchronization. . GE partitions an n-by-n matrix into threads by using the row-wise block-cyclic approach. Initially, one thread performs the division step with its pivot value and, then, all other threads perform an elimination step. These two steps are coordinated with barriers. GE threads tend to have very separate and distinct data sets with minimal data sharing besides the pivot value. . FFT implements a complex 1D version of the n p six step FFT algorithm. The data set consists of n complex data points and another n complex data points called the roots of unity. Every thread is responsible for transposing a contiguous submatrix n p =p Â n p =p with every other thread and one submatrix by itself. The data sets between threads are very localized. . MP3D is a simple simulator for rarefied gas flow over an object in a wind tunnel. The algorithm is primary occupied with a loop consisting of three phases. Each thread is given particles and proceeds to move them within a defined cell space. The thread continuously detects any possible collisions of its molecules with other molecules and updates the geometry of molecules each time step. MP3D algorithm contains data that is very localized and shares much of that data among threads. Also, each phase has to be completed by all the threads before continuing to the next phase, requiring a large amount of synchronization. . RS is a sorting algorithm where a local histogram is generated based on its assigned key values. Then, all the local histograms are combined to into a globally shared histogram. After that, each thread iterates over its assigned array and, by using the global histogram, permutes its keys into a new sorted array. The data set is very localized, and the data accesses tend to be very disjoint except when the global histogram is accessed.
SIMULATION RESULTS
Two sets of simulation runs were performed for each benchmark described in Section 4.2. The first set was obtained by running serial versions of the benchmarks and the second set was obtained by running multithreaded versions on MVP. Approximately 500 million to 1.2 billion instructions were simulated with the number of memory references ranging from 88 million to 300 million. Simulated MVP had two, four, and eight hardware contexts, and the number of threads created for each simulation run was the same as the number of hardware contexts. Fig. 2 shows the portion of the execution time that the processor idles due to L2 cache misses for serial execution. MMT experienced the most stalls among the benchmarks, ranging from 20 percent to 55 percent of the total execution time. FFT and GE showed approximately 20 percent and 32 percent of stalls on the average, respectively. On the other hand, RS exhibited less than 6 percent of stalls for various problem sizes. Interestingly, the stall time for MP3D increased proportionally to the number of molecules. In general, it can be seen that, except RS, the serial execution incurs more than 20 percent of processor stalls during its execution due to L2 cache misses. Fig. 3 shows the relative performance of MVP. These results were normalized relative to the performance of the serial versions. Fig. 3 shows that, as the problem size increases, MVP begins to overcome its overhead and performs better than the serial cases. An example of this effect is displayed by MP3D. FFT and GE also show a significant performance improvement with increasing data size as they begin to take advantage of the latency tolerance of multithreading. As expected, MMT showed the best performance improvement due to the fact that a large amount of stalls were incurred for the serial execution and no interthread synchronization was necessary among threads.
Another interesting effect is that the use of more hardware contexts does not necessarily result in improved performance, as seen in both RS and MP3D. This effect is due to the benchmarks' high synchronization requirements and small parallel portions. However, since the performance margin narrows as the problem size increases, we expect that MVP with four and eight hardware contexts would eventually outperform the two hardware contexts case as the problem size becomes large.
Sharing in Cache
To gain a good understanding of how the L1 and L2 caches are affected by multithreading, the cache miss rates for the serial versions and MVP versions with two, four, eight contexts were compared. Fig. 4 shows the miss rates for the L1 D-cache and the L2 cacheÐthe cache miss rates for L1 Icache were omitted since observed miss rates were below 0.5 percent for most benchmarks. The results were rather interesting. The L2 miss rates for MVP are lower than the serial versions for most of the cases. This effect is seen in all the benchmarks except GE and RS. Lower L2 miss rates are due to the fact that the data sets used by these programs have high data locality. This locality is exploited when a cache miss caused by a thread brings in the data that other threads may need later. For example, consider the MMT case with two threads multiplying A and B matrices. Assume a cache miss occurs for a column value of B while Thread 1 is performing multiplication on a row of A matrix with a column of B matrix. This cache miss may fill the cache with the column values of B, which may be needed by Threads 2 for the computation of matrix C. Therefore, when Thread 2 looks for its column, the column value may already be in the cache (i.e., prefetching effect). This data sharing occurs because a cache line is essentially part of a row in the matrix. Also, note that the columns of B themselves are shared among threads by the nature of the MMT algorithm.
Another benchmark that exhibits data sharing is MP3D. In MP3D, the workload is partitioned by molecules, and a molecule is always moved by the thread it belongs to. Also, the partition of molecules changes significantly in each time step. Therefore, access pattern to the space array tends to exhibit low locality for the serial version. However, in the MVP version, it is quite possible for different threads to access the given space at the same time, and this creates the data sharing effect during collision computations. The result is lower cache miss rates for the MVP version.
On the other hand, RS shows the increased miss rates for both L1 and L2 caches for some cases. This effect is caused by the sorting portion of the RS algorithm. When the threads sort their individual keys of the array, the data set becomes very disjoint between threads, and the L2 cache miss rate increases. Similarly, GE also has very distinct data sets with low locality among threads. Therefore, a thread, upon generating a cache miss, would simply bring in more of the rows belonging to the same thread. The result is that the threads compete for space within the L2 and result in a higher L2 miss rate than the serial version.
Our simulation results thus far indicate the importance of the cache behavior, in particular, the data sharing effects on the overall performance. However, the miss rates alone do not accurately reflect the cache behavior since the total number of cache misses also depends on the number of accesses as well as the miss rate. Therefore, we examined the total number of accesses to the L1 caches. When compared to the serial versions, the number of accesses to I-cache for MVP, in general, increased 3 percent to 13 percent for all benchmarks. The increase in the number of accesses was a result of the MVP versions having to execute more instructions for thread management, synchronization, and context switching. Furthermore, the dynamic execution that results from context switching in MVP hindered the locality in the I-cache and generated more misses compared to the serial execution. The D-cache also experienced 1 percent to 32 percent increase in the number of accesses. There are two main reasons for this behavior. First, the speculative execution seems to have more profound effect on the MVP versions since multithreading effectively increases the overall amount of the speculative fetches by flushing the data on context switching and then refetching the data later. Second, the software overhead of thread management and synchronization inevitably increases the number of data accesses.
An Analysis of Multithreaded Execution
In order to provide better understanding of how cache behavior and other components of multithreading contribute to the overall performance, we developed a simple analytical model that compares the serial and the multithreaded execution models. First, consider the serial case. Suppose the processor executes instructions on the average for R ser cycles before a cache miss occurs. Let L denote the average cache miss penalty, which represents the main memory access time and the time to refill the cache line. Assuming M ser number of cache misses occurs during the program execution, the total execution time for serial model, T ser , is given by
The performance of multithreading depends on two major factors: available parallelism and processor resources. If an application does not exhibit sufficient amount of parallelism (i.e., thread parallelism) for multithreading, the processor utilization will not increase. Even if parallelism exists, the sharing of processor resources (e.g., caches, functional units, etc.) among threads, the context switching costs, and the overhead of thread management and scheduling may limit the overall performance. For simplicity and convenience, assume that enough parallelism exists for multithreading throughout the program execution. Suppose a multithreaded processor performs a contextswitch at an interval of R mt cycles at a cost of C cycles. Assuming the processor performs M mt context-switches during its execution and the overhead of O cycles is spent on thread management and scheduling, the execution time of the multithreaded model can be given as T mt O R mt CM mt if the cache miss latency is completely masked by the execution of other threads. Otherwise, the performance will suffer from the unmasked portion of the miss latency. In this case, the processor can execute H À 1 t h r e a d s t o t o l e r a t e t h e m e m o r y l a t e n c y ( i . e . , R mt H À 1 ! L, where H is the number of hardware contexts. The execution time is then expressed as
The term L À R mt H À 1M mt in the above equation represents the total amount of memory latency that could not be tolerated despite executing other threads. Therefore, the effective improvement of the multithreaded execution over the serial execution, T impr , can be estimated as
An approximation for (3) is obtained by assuming that the amount of computation for a given algorithm is about the same for both serial and multithreaded executions (i.e., R ser M ser % R mt M mt . The first term, R mt H À 1M mt , represents how much of the memory latency can be tolerated by the multithreading or the amount of overlapping that occurs between memory latency and computation (Tolerance). This term reflects the fact that the tolerance to memory latency increases as the number of hardware context increases. However, if the program does not have sufficient parallelism to fully utilize the hardware contexts, the tolerance will be limited by the available parallelism in the program rather than the hardware contexts. The term, LM ser À M mt , on the other hand, can be used as a measure to reflect the sharing effects of multithreading (Sharing Eff). If the multithreaded execution causes fewer cache misses compared to the serial execution, this term becomes positive, signifying that data locality exists and the threads share a certain portion of the data (i.e., working set) during the execution. Otherwise, the conflict among threads will increase the number of cache misses, thus the term becomes negative, indicating less data locality. Also, the last two terms O and M mt C represent the overhead of multithreading, where O reflects the cycles spent for thread management and scheduling (Overhead) and M mt C is the total number of cycles required for hardware contextswitches (Switching Cost). The effects of the four components in (3) were studied for MVP with four hardware contexts. Fig. 5 shows the percentage of each component to the total execution time for each benchmark. Each component was computed using the parameter values obtained from the simulation results. The negative values indicate the number of additional cycles incurred in MVP over serial execution, and positive values signify the amount of benefit obtained from multithreaded execution. Note that the cumulative effects of the four components reflect the performance improvement in MVP and is consistent with the results shown in Fig. 3. Fig. 5 shows that the tolerance generally increases as the data size increases in all the benchmarks except MMT. Also, the hardware context switching costs of MVP seem to have a minimal effect on the overall execution time. MMT and MP3D have positive sharing effects, indicating that the performance of multithreading can also benefit from data sharing if the algorithms can take advantage of the locality. A similar effect is observed for the 320 case in GE, and the observed speedup was greater than other cases because the performance benefited not only from its latency tolerance, but also from the positive sharing effect due to the lower L1 miss rates (compared to the serial versions). Also, note that the performance benefit for GE comes mostly from the tolerance rather than the data sharing.
MP3D exhibits a large growth in both its tolerance and positive sharing effect as the problem size increases. Therefore, the speedup increases almost linearly to the problem size. On the contrary, RS shows a balanced effect of positive growth for the tolerance and the negative sharing effect as the data size increases. The result is that the memory latency resulting from additional cache misses offsets the tolerance and, thus, reduces amount of the effective improvement from multithreading. The graph for FFT is very similar to RS, but the smaller negative sharing effect and the larger tolerance resulted in better performance improvement.
For both FFT and RS, the negative sharing effect was greater than the overhead or the switch cost. Therefore, the memory latency due to cache misses had a more significant influence on the performance of MVP for these two benchmarks, whereas the overhead cost had a greater influence on GE and MP3D. GE benchmark has barriers between the division and the elimination steps, and these operations had to be repeated many times before the computation was done; thus, a large overhead was incurred for synchronization. MP3D also has a large amount of barrier synchronization and, thus, resulted in a relatively large overhead.
CONCLUSION AND FUTURE RESEARCH DIRECTIONS
This paper discussed the simulation study of MVP on the cache performance. The MVP execution model showed 10 percent to 80 percent performance improvement over the serial execution model. Our results show that the performance improvement comes from both tolerating memory latency and exploiting data locality. When data locality can be exploited, it further enhances the overall performance improvement. On the other hand, when the sharing effect is minimal, it results in additional conflict misses thereby offsetting the advantage of multithreading. Results also show that the software synchronization requirements among threads can degrade the performance of some of the simulated benchmarks. Based on our study, we plan to improve MVP in a number of ways. First is to provide hardware support for synchronization. This improvement will not only reduce the software overhead, but may also lower the number of conflict misses due to less number of context-switches. Second, the sharing effect is important and must be encouraged whenever possible. For example, Philbin et al. used threads scheduling to improve the cache locality of serial programs [21] . In this scheme, the address information associated with each thread is provided to the scheduler and the threads are scheduled in the order that minimizes the cache misses. Therefore, we plan to further study the thread scheduling as a means of improving data locality.
Another research direction is to modify MVP to support SMT with dynamic thread creation and speculative execution, called Dynamic Simultaneous Multithreading (DSMT). The idea is to detect and create threads from a serial program without compiler intervention. These dynamic threads will be executed concurrently on a SMT-like machine with special hardware support for register and memory dataflow and speculative execution. The benefits of DMT are several-fold: 1) Eliminate the need for programmers and compilers to generate multithreaded codes, 2) Overcome the technological limitations of arbitrarily increasing the instruction window size to achieve a wide-issue bandwidth, 3) Speculative execution can be aggressively applied across multiples threads, and 4) Reduce the required fetch bandwidth by taking advantage of the fact that multiple threads share a common code.
Hantak Kwak received the BS degree in electronic engineering from SungKyunKwan University, Korea, in 1984, the MS degree in electrical engineering from South Dakota State University in 1987, and the PhD degree from Oregon State University in 1998. He has worked on implementing hardware/software support for multithreading in superscalar architecture, and his reseach interests include microarchitecure, memory system, network computing, and software/hardware support for multithreading. Woo-Jong Hahn received the BS, MS, and PhD degrees from Korea University in 1981, 1984, and 1995, respectively. From 1986 to 1988, he worked on 64-bit processor and workstation server development project at AIT in Cupertino, California. Currently, he is a principal member of the engineering staff at the Electronics and Telecommunications Research Institute (ETRI) in Taejon, Korea. His main interests are computer architecture with a focus on multiprocessor and parallel processing, memory hierarchy and interconnection network in a large scale system, microprocessor architecture, and multimedia server architecture.
