A Preliminary Performance Study of Architectural Support for Multithreading by Ortiz-Arroyo, Daniel
   
 
Aalborg Universitet
A Preliminary Performance Study of Architectural Support for Multithreading
Ortiz-Arroyo, Daniel
Published in:
Proceedings of the Thirtieth Hawaii International Conference on System Sciences, 1997
Publication date:
1997
Document Version
Early version, also known as pre-print
Link to publication from Aalborg University
Citation for published version (APA):
Ortiz-Arroyo, D. (1997). A Preliminary Performance Study of Architectural Support for Multithreading. In
Proceedings of the Thirtieth Hawaii International Conference on System Sciences, 1997 IEEE.
General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.
            ? Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
            ? You may not further distribute the material or use it for any profit-making activity or commercial gain
            ? You may freely distribute the URL identifying the publication in the public portal ?
Take down policy
If you believe that this document breaches copyright please contact us at vbn@aub.aau.dk providing details, and we will remove access to
the work immediately and investigate your claim.
Downloaded from vbn.aau.dk on: May 01, 2017
A Preliminary Performance Study of Architectural 
Support for Multithreading 
Daniel Oltiz and Ben Lee 
Department of Electrical and Computer Engi- 
neering 
Oregon State University 
Corvallis, OR 97331 
{dortiz, benl)@ece.orst.edu 
Abstract 
This paper discusses the preliminary performance study of 
hybrid multithreaded execution model that combines 
software-controlled multithreaded system with hardware 
support for efficient context switching and threads sched- 
uling. The hardware support for multithreading is aug- 
mented with a software thread scheduling technique called 
setscheduling, and their benefit to the overall performance 
is discussed. Set scheduling schedules multiple threads 
onto the hardware scheduler to minimize the software 
scheduling and context switching costs. An analytical 
model of the proposed multithreaded model is discussed 
and simulation results of processor utilization based on the 
proposed model are presented. Through simulation, we 
find that the hybrid multithreaded execution model results 
in high processor utilization than traditional software- 
controlled multithreading. 
1. Introduction 
Multithreading has been proposed as a promising tech- 
nique to improve the performance of shared-memory mul- 
tiprocessor systems, In a multithreaded system, high 
processor utilization is achieved by interleaving the exe- 
cution of a number of computational threads through a 
processor pipeline. To achieve maximum efficiency, a 
context-switch occurs when a thread execution blocks due 
to long latency operations, such as a cache miss or a 
thread synchronization. In the case of a cache miss, the 
,requested data may be obtained from either the local 
memory-in the case of Uniform Memory Access (UMA) 
machines-or a remote memory access will be issued-in 
the case of Cache Coherent Non-Uniform Memory Access 
This research was supported in part by the Electronics and Telecom- 
munications Research institute, Taejon, Korea. 
Suk-Han Yoon and Kee-Wook Rim 
Computer Division 
Electronics and Telecommunications 
Research Institute 
Taejon, Korea 
{shyoon, kwrim)@etri.reJu 
(CC-NUMA) machines. The memory latency is then hid- 
den by overlapping it with the execution of a new thread. 
Traditionally, support for multithreading has been 
provided either in software or hardware. The hardware 
support for multithreading is done by providing fast con- 
text switching capabilities and multiple hardware context 
in the processor. The degree of hardware support pro- 
vided can vary greatly. For example, it can be as simple 
as SPARC register windows for supporting multiple 
hardware contexts [l] or as complex as Tera multiproces- 
sor where each processor supports up to 128 processor 
states and can context-switch on a cycle-by-cycle basis 
121. 
One approach to implementing multithreading in 
software is by using special languages and compilers that 
automatically generate multiple threads for execution. An 
example that follows this approach is TAM 161. TAM re- 
lies on an appropriate compilation strategy and program 
representation rather that elaborate hardware. However, 
this approach requires languages with functional seman- 
tics and complex compiler analysis to generate threads 
An alternative method is to use traditional languages 
extended with software system support at various levels 
(herein refer to as sofnuare-controlled multithreading) [4, 
9, lo]. For example, user-level multithreading support is 
provided by a collection of library function calls to create, 
synchronize, and schedule threads. At the system-level, 
the kernel manages all thread activities. There is also an 
approach where thread management is implemented en- 
tirely as a user-library [9]. One such is the POSIX 
1003.4a threads extension [SI, or Prkreads for short. 
Pthreads provides various functions, such as thread crea- 
tion and synchronization, mutual exclusion, conditional 
variables, etc., to support multithreaded programming. 
Pthreads is widely available and runs on numerous com- 
mercial platforms including SGI-IRIX, Alpha-OSF, 
SPARC-SunOS, HPPA-HPUX, R2OOO-Utrix, etc. 
In light of the aforementioned discussion, this paper 
presents the hybrid multithreaded model, which is a 
WI. 
1060-3425/97 $10.00 0 1997 IEEE 227 
... 
Figure 2.1. Software-controlled multithreading. 
software-controlled multithreaded system extended with 
hardware support for efficient context switching and 
thread scheduling. The idea behind the hybrid approach is 
to utilize all the existing features in a software-controlled 
multithreaded system and at the sametime migrate some 
of the responsibilities of thread management to hardware. 
This is achieved using a technique called set scheduling 
that acts as an interface between hardware and software 
and yet provides a transparent view to the programmer. 
The main advantage of the hybrid method is that expen- 
sive software context switching and thread scheduling 
costs occur only when threads are initially scheduled onto 
the processor and any subsequent context switching and 
thread scheduling are implemented in hardware. Over 
time, this leads to considerable reduction in the overhead 
cost thereby resulting in high processor utilization. 
The organization of the paper is as follows: In Sec- 
tion 2, hybrid multithreaded execution model is described 
and a simple analytical model is presented. Experimental 
results obtained by simulation are described in Section 3. 
Section 4 provides a brief conclusion and possible future 
work. 
2. The Hybrid Multithreaded Model 
In order to illustrate the advantage of having hardware 
support for multithreading, consider the software- 
controlled multithreaded execution model illustrated in 
Figure 2.1. Each thread issues a remote reference at an 
interval of R cycles, i.e., run-length, and becomes blocked 
for L cycles waiting for the response to retum before re- 
suming execution. L depends on the memory access time 
and the delay through the interconnection network to and 
from memory. Between run-lengths, a context-switch 
occurs at a cost of C cycles. For the software-controiled 
multithreaded model, the cost of thread scheduling is in- 
cluded in the context switch overhead. The processor 
utilization Us, based on this execution model is given by 
c121 
If the number of contexts supported is not sufficient, the 
processor will not be able to completely hide the memory 
latency L and will cause the processor to idle for I cycles 
(as in the case of Figure 2.1). On the other hand, if there 
is sufficient number of contexts, the processor utilization 
Us, depends only on R and C. 
As can be seen by Equation (l), the processor utiliza- 
tion is directly affected by the context switching and 
thread scheduling costs. For software-controlled mul- 
tithreading, each thread is associated with a context that 
contains a thread ID, a set of registers including a PC, a 
thread priority, and a pointer to the stack. Whenever, a 
context switch occurs, a new thread has to be selected (Le., 
scheduled) from a pool of ready threads, all the registers 
associated with the current thread must be flushed onto the 
stack before registers are loaded with the top frame of the 
new thread. This is done automatically by the Thread 
Management System, which is expensive. To reduce this 
cost, the objective of the proposed hybrid multithreaded 
model is to provide part of these features in hardware to 
make multithreading as efficient as possible, and yet pro- 
vide a transparent view to the programmer. 
Hardware 
4 [FHz]  Contexts Scheduler 1 
Conventional Superscalar Processor Core ,I 
Figure 2.2. Coordination between hardware and software 
schedulers. 
228 
with HadwaE 
Scheduler 
Figure 2.3. Hardware support for multithreading. 
Although hardware support for thread scheduling and 
context switching would benefit any processor design, the 
challenge is to incorporate these features with minimal 
modifications to the operating system and the compiler, 
and at the same time to work within the constraints estab- 
lished by the base processor architecture. 
Figure 2.2 shows the hardware and software schedul- 
ers that coordinate the thread selection and execution in 
the proposed hybrid multithreaded model. The software 
side of our model basically consists of an existing Thread 
Management System augmented with Software Scheduler 
that manages the Thread Pool. In most systems, the 
Thread Pool is implemented as a multi-level priority 
queue. In these systems, a thread has a priority assigned 
by either the Thread Management System or the user. 
The responsibility of the Software Scheduler is to 
select a set of threads from the Thread Pool and schedule 
them onto the Hardware Scheduler of the processor that 
supports multiple hardware contexts. Threads are grouped 
into sets by the Software Scheduler with the objective of 
maximizing processor utilization. There are a number of 
possible policies that can be used to schedule thread sets 
onto the Hardware Scheduler. One straight forward a p  
proach is to schedule the next set of threads only after the 
previous selected threads have completed their execution. 
This approach is the most appropriate if thread run-lengths 
are about the same. However, if the thread run-lengths 
vary other possible scheduling policies exist. We explore 
these possibilities in Section 3. 
Hardware support for our model consists of a con- 
ventional superscalar processor core augmented with the 
Hardware Scheduler and multiple hardware contexts. 
Once a thread has been scheduled onto the processor, it 
can be in one of the following three states: running, ready, 
or sleeping. The responsibility of the Hardware Scheduler 
is basically to maintain the control of thread states that 
have been scheduled onto the processor by the Software 
Scheduler. This is done by using the Ready-thread Queue 
( R e )  and the Sleeping-thread Queue ( S Q ) .  Figure 2.3 
shows the hardware support needed for our hybrid model. 
A long latency operation detected by the memory man- 
agement unit (MMU) causes a thread to context-switch. 
This is accomplished by the Hardware Scheduler where 
the blocked thread is placed in SQ and a new thread is 
scheduled from RQ. 
In addition to the hardware support shown in Figure 
2.3, a processor also needs multiple hardware contexts. 
This can be implemented in a number of ways. One pos- 
sible method is to provide a separate, fixed-size contexts 
using a hardware managed register file (in the form of 
either register windows or duplicated register sets). How- 
ever, this fixed and inflexible partitioning of the register 
file results in a waste of scarce high-speed registers. Since 
the number of registers required by thread contexts vary, a 
more flexible approach, called Register Relocation, has 
been proposed [16]. This method relies on the compiler or 
run-time system to mange the allocation and use of con- 
texts. Instruction operands specify context-relative regis- 
ter numbers, which are numbered consecutively starting 
with register 0. These context-relative register numbers 
are dynamically combined (using an OR operation) with a 
special register relocation mask to form absolute register 
numbers that are used during instruction execution. We 
are currently investigating which approach is more suit- 
able for the proposed hybrid model. 
In order to manage multiple contexts, each context 
inside the processor is represented by a tag T, containing a 
thread ID, a PC, and a pointer to the thread stack. When 
threads are scheduled by the Hardware Scheduler, the tags 
of the threads are down loaded onto the RQ- A thread then 
can be scheduled by simply dequeuing its tag from RQ, 
updating the stack register and fetching the first instruc- 
tion pointed to by PC. When a thread is blocked, its tag is 
placed in the SQ and a context-switch occurs to the next 
thread in RQ. Later, when the block thread changes its 
state to ready, it is enqueued onto RQ. When all the 
threads from RQ (i.e., within a processor) have completed 
their execution, the Software Scheduler schedules a new 
set of threads. 
229 
I l T m  
Figure 2.4. Queue management by the Thread Sched- 
uler. 
In order to keep track of the transition between 
sleeping and ready threads, each Tin SQ is associated with 
a timer, wt. This is shown in Figure 2.4. When a context 
switch occurs during the execution of the thread T, , it is 
sent to SQ with wt set to L and a new thread T,, is se- 
lected for execution from RQ. T, will remain in SQ for L 
cycles waiting for the memory to respond to its request. 
Eventually, when L cycles have elapsed, T, will be placed 
into RQ by the Hardware Scheduler and its state will be 
changed to ready. 
When R and L are constant, SQ will behave as a FIFO 
queue and thus each thread will be retired from SQ in or- 
der. However, this is not a realistic assumption because in 
UMA machines bus contention will cause L to vary. 
Moreover, in CC-NUMA machines, the network conten- 
tion and routing algorithm will affect L. Variation in 
memory latency can be handled by mapping cache line 
tags to wt. The Hardware Scheduler then simply identifies 
threads whose request has been served by enqueuing it on 
to RQ. 
A simple analytical model for our hybrid mul- 
tithreaded system is obtained by considering the effects set 
scheduling operations have on processor utilization. Fig- 
ure 2.5 shows the proposed multithreaded execution 
model through a series of set scheduling operations. 
During each set scheduling operation, the Software 
Scheduler of the Thread Management System schedules N 
threads onto the RQ at a cost of S cycles, Le., S=NC. Be- 
tween set scheduling operations, there is a total of G 
hardware context switches, each with a cost of c cycles, 
among the N contexts scheduled onto the processor. 
Assuming that R ,  L,  c ,  and C are constant, we can 
express processor utilization for two separate cases. In the 
first case, the number of contexts supported by the proces- 
sor is not enough to hide the memory latency, and there- 
fore the processor utilization U ,  increases linearly as a 
function of N ,  i.e., 
NR 
NC R+L+- 
G 
u, = 
where G represents the total number of context switches 
for all the threads and therefore G/N represents the aver- 
age number of context switches in a thread. In the second 
case, the number of contexts is sufficient to hide the la- 
tency, thus performance loss comes from the context 
switching overhead and the set scheduling cost (as in the 
case of Figure 2.5), i.e., 
R 
R+c+- NC 
G 
u, = (3) 
Equation (3) shows the software scheduling and context 
switching cost C in Equation (1) has been replace by the 
hardware context switch cost c plus the amortized soft- 
ware context switching cost over the average number of 
context switches in a thread NCIG. This means even in 
the saturation region GIN has some effect on processor 
utilization. However, if GIN is sufficiently large, the 
proGessor utilization improves by a factor of (R+C)I(R+c). 
I - L - 4  
Figure 2.5. Hybrid multithreaded execution model. 
230 
0.98 
+ .  .+...+...+. ..+...+ ..+ .+.. +. +.. + ..+ ...+ 
iq 
U C S O  
- 
'95 2 4 6 8 10 12 14 16 $8 
Nu- of cwdaxh 
Figure 3.1. Comparison between theoretical (solid lines) 
and simulation (dotted lines) results. 
3 .  Simulation Results 
In order to evaluate the performance of the hybrid mul- 
tithreaded system described in the previous section, a 
simulation study was conducted. Figure 3.1 compares the 
theoretical results and the results obtained from our simu- 
lation for the hybrid multithreaded model on processor 
utilization when R andL are constant for various values of 
C. Plots were obtained by running 1,000 threads' with 
R=IOO cycles, c=2 cycles, and k500 cycles. The com- 
parison shows that the simulations results were compara- 
ble to the theoretical results from equations (2) and (3). 
More important, as C increases from 10 to 50, the proces- 
sor utilization decreases only by approximately 2%. The 
primary reason for this is set scheduling cost is incurred 
only once and all subsequent context switches are done in 
hardware. Therefore, the hybrid method is more immune 
to variations in C. 
To obtain a more realistic evaluation of our hybrid 
model, probability distributions were considered for R and 
L. Figures 3.2a and 3.2b shows the effects for both the 
hybrid and software-controlled models when R was mod- 
eled by a geometric distribution and L by a negative expo- 
nential distribution. Again, our simulation results were 
obtained by running lo00 threads with an overall execu- 
tion time of approximately 500,000 cycles. 
Figure 3.23 compares the performance when R has a 
mean value of 100 cycles, L has a mean value of 500 cy- 
cles, and c = 2 cycles. Results show that not only does the 
hybrid model outperform its software-controlled counter- 
part, but because it is more immune to variations in C the 
performance (Le., processor efficiency) gap widens as C 
increases. Our findings also indicate the performance of 
the software-controlled execution model is strongly af- 
fected by granularity of threads. This can be seen in Fig- 
ure 3.2b where R has a mean value of 20 cycles, L has a 
mean value of 100 cycles, and c = 1 cycle. When CIR is 
large, the performance of the software-controlled is se- 
verely affected by the software scheduling and context 
switching costs. 
Another interesting observation is when thread run- 
lengths vary the utilization goes down (see Figure 3.1 and 
3.2a-b). This is because when thread run-lengths are the 
same, all threads complete their execution about the same 
time. Therefore, scheduling the next set of threads only 
after the previous set of threads have completed execution 
will cause minimal idling. However, when thread run- 
lengths vary, some threads will complete first reducing the 
number of threads from which to context switch. 
0.9 - 
0.8 - 
,0.7 - 
0 
a -a 0.6 - m 
0 0.5 - 
X 
i 
I 
0.1 2 4 6 8 10 12 14 16 18 
NumberotCMnens 
Figure 3.2a. Comparison between hybrid (solid 
lines) and software-controlled (dotted lines) execution 
models: R has a mean value of 100, L has a mean 
value of 500, and c=2. 
'i- 
+. _ _  .+ , . .* . . . * ...+ ...+. .. .+ 
o, ,,o .... 0 .... o...*. -0  .... 0. .o. .  .o .... o. . .o 
2 4 6 8 10 I 2  14 16 18 
0 . 8  , 
Number of Cont& 
Figure 3.2b. Comparison between hybrid (solid lines) and 
software-controlled (dotted lines) execution models: R has 
a mean value of 20, L has a mean value of 100, and c=l . 
23 1 
^ ^  'I --+- 
0.8 
0.7 
6 
i 
- 5
$0.5 
H 
0.6 
L 0
CL 
0.4 
' 
0.3 t 1 
- 
- 
- 
- 
- 
2 4 6 8 10 12 . 14 16 18 0.1' ' 
N u m b e r  of Contexts 
Figure 3.3a. Effects of scheduling policies when C=lO, 
R=20, L=lOO, and c=l . 
2 4 6 8 10 12 14 16 18 
Number of Contexts 
Figure 3.3b. Effects of scheduling policies when G 2 0 ,  
R=20, L=lOO, and c=l . 
0.9 'i 
O'V 0.2 
To overcome this deficiency, different scheduling 
policies were tried. They are (1) a new thread is scheduled 
immediately after a thread completes its execution and (2) 
schedule N/2 new threads when N/2  threads complete their 
execution. Results of these two scheduling policies were 
then compared against scheduling N new threads when N 
threads finish their execution. These are shown in Figures 
3.3a-3.3~ for various values of C. In these graphs, R was 
modeled by geometric distribution with a mean value of 
20 cycles, L by a negative exponential distribution with 
mean value of 100 cycles, and c=l cycle. These results 
show that for all three values of C it is always better to 
scheduled a new thread immediately after a thread com- 
pletes its execution. Thus, scheduling one at a time will 
minimize idling due to lack of threads from which to con- 
text switch. 
4. Conclusion and Future Work 
Our preliminary performance study indicates that the pro- 
posed hybrid multithreaded model results in improved 
processor utilization over software-controlled mul- 
tithreading. Higher processor utilization is achieved by 
having the Software Scheduler set schedule threads onto 
the Hardware Scheduler. The effects of various set sched- 
uling techniques on the overall performance of the hybrid 
multithreaded system were studied. Set scheduling tech- 
nique basically acts as an interface between existing soft- 
ware-controlled multithreaded system and the hardware 
support for multithreading. We found that set scheduling 
technique together with hardware support for mul- 
tithreading has considerable performance advantage over 
traditional software-controlled multithreaded systems. 
Although our performance results are encouraging 
they are based on a simple execution model and therefore 
quite preliminary. The future plan is to develop a detailed 
simulator for the hybrid multithreaded model. We are 
currently working on such a simulator that integrates the 
user-library Pthreads package developed by Chris Proven- 
zano at MIT2 with MIPS-based generic superscalar simu- 
lator, called Simplescalar, developed at University of 
Wisconsin [5]. 
Using the hybrid multithreaded processor simulator, 
we plan to pursue a number of design issues. First, it is 
not clear at present what kind of hardware context repre- 
sentation is the most appropriate for our hybrid mul- 
tithreaded processor. Multiple hardware contexts can be 
implemented either by duplicating register sets or using 
register relocation. Register relocation is more flexible 
but requires modification to the compiler. As a first-cut 
design, the plan is to use multiple-registers sets and use 
thread tags to map onto register sets. 
For more information on Fthreads package see 
http://www .mi t .edu : 800 1 /people/proven/pthreads .html. 
232 
Another issue is the design of the instruction window. 
Currently, SimpleScalar implements centralized instruc- 
tion window, where data hazards are resolved and ready 
instructions are issued to functional units. Once an in- 
struction from a thread is issued to a functional unit, any 
subsequent blocking of that thread will not affect the exe- 
cution of that instruction. However, this is not the case for 
instructions from the blocked thread waiting in the in- 
struction window to be issued. These unissued instruc- 
tions will continue to occupy valuable resources and im- 
pede the execution of other ready threads. There are two 
ways to resolve this problem. One method is to imple- 
ment multiple instruction windows and multiplex the 
thread issuing among them. The other method is to simply 
buffer the blocked thread. Thus, there will be one instruc- 
tion window and N-1 thread buffers. The latter method 
would be much cheaper but will result in higher hardware 
context switching cost since threads have to be move back 
and forth between the instruction window and thread buff- 
ers. 
Another possibility we plan to explore is simultane- 
ous multithreading (SMT) [15]. Simultaneous mul- 
tithreading is a technique where multiple independent 
threads issue instructions to a wide-issue superscalar proc- 
essor’s functional units in a single cycle. The advantage 
of SMT is that both instruction-level parallelism and 
thread-level parallelism can be explored to achieve high 
performance. Implementing SMT will require relatively 
minor changes to the hybrid multithreaded processor4e 
instruction fetch mechanism can be implemented as multi- 
ple instruction window. However, since SMT proposed in 
[15] uses independent threads from different programs and 
SMT in hybrid multithreaded processor occurs among 
threads from the same program, we plan to study what 
effect interaction among the threads within a program will 
have on the design of the microarchitecture. 
5. Biblwgraphy 
[l] Agarwal, A. et al., “April: A Processor Architecture for 
Multiprocessing,” Proc. ITh Annual Int’l. Symposium on 
Computer Architecme, May-1990, pp. 104-1 14. 
[2] Alverson, R. et ol., ‘The Tera Computer System,” Interna- 
tional Conference on Supercomputing, Sept. 1990, pp. 1-6. 
[3] Ang. B. S. et ai., “Star-T the Next Generation: Integrating 
Global Caches and Dataflow Architectures,” Technical Re- 
port CSG Memo 354, LCS MI“, Feb. 25,1994. 
[4] Blumofe, R. I>. et ai., “Cilk: An Efficient Multithreaded 
Runtime System,” Proc. of the SIh ACM SIGPLAN S p p o -  
siwn on Principles and Practice of Parallel Progrunvning, 
July 1995. 
[5]  Burger, D. et af. ,  “Evaluating Future Microprocessors: The 
SimpleScalar Tool Set,” UW Computer Sciences Technical 
Report #1308, July, 1996. 
[6] Culler, D. E. et al., “TAM-A Compiler Controlled Threaded 
Abstract Machine,” Journal of Parallel and Distributed 
Computing, Vol. 18, No. 3, July 1993, pp. 347-370. 
[7] Chiou D. et al., “*T-NG: Delivering Seemless Parallel 
Computing,” Proceedings of Euro-Par 95,1995. 
[S] IEEE, Threads Extension for Portable Operating Systems 
(Draft 6). Feb. 1992. P1003.4dD6. 
[9] Mueller, F., “A Library Implementation of POSIX Threads 
under UNIX,” Proc. I993 USENIX Winter Conference, San 
Diego, CA, pp. 29-41. 
[lo] Nikhil, R. S., “Cid A Parallel, ‘Shared-memory’ C for 
Distributed-memory Machines,” h o c .  ?Annual M h p .  on 
Languages Md Compilers for Parallel Computing, Ithaca 
NY August 1994, Springer Verlag LNCS. 
[ 1 I] Papadopodos, G. M. et al., “XT: Integated Building Blocks 
for Parallel Computing,” Supercomputing93, Portland, Ore- 
gon,Nov. 19,1993. 
[12] Saaveh, R. H. et al.,“Analysis of Multithreaded Architec- 
tures for Parallel Computing,” r*‘AnlucaI ACM Symposium 
on Parallel Algorithms Md Architectures, July 1990, pp. 
169-178, 
(131 Schauser, K. E. et d., “Compiler-Controlled Multithreading 
for Lenient Parallel Lan,wges,” SIh ACM Conference on 
Functional Programming Lunguuges and Computer Archi- 
tecture, Aug. 1991, pp. 50-72. 
[ 141 Thekkath, R. and Eggers, S. J., ‘The Effectiveness of Mul- 
tiple Hardware Contexts,” 6* Proceedings of lnternational 
Conference on Architectural Support for Prograrmning 
Languages and Operating Systems, Oct. 1994, pp. 328-337. 
[15] Tullsen, T. M. et al., “Simultaneous Multithreading: Maxi- 
mizing On-Chip Paxdlelism”, Proc. tz” ~nnual  Int‘Z. Sym- 
posium on Computer Architecture, Jun. 1995. 
[16] Waldspuger, C. A. and WeihJ, W. E., “Register Relocation: 
A Flexible Contexts for Multithreading”, Proc. Z@ Annual 
Int’l Symposium on Computer Architecure, 1989, pp. 273- 
280. 
233 
