Architectural support for multithreading on a 4-way multiprocessor by Lee, Ben et al.
AN ABSTRACT OF THE THESIS OF 

Gwang-Myung  Kim  for  the  degree  of Master of Science  in  Electrical  and  Computer 
Engineering  presented  on  December  10.  1999.  Title:  Architectural  Support  for 
Multithreading on a 4-way Multiprocessor. 
Abstract approved: ----~-r
Ben Lee 
The microprocessors will have more than a billion logic transistors on a single chip 
III  the  near  future.  Several  alternatives  have  been  suggested  for  obtaining  highest 
performance with billion-transistor chips.  To achieve the highest performance possible, 
an  on-chip  multiprocessor  will  become  one  promising  alternative  to  the  current 
superscalar  microprocessor.  It  may  execute  multiple  threads  effectively  on  multiple 
processors in parallel if the application program is parallelized properly.  This increases 
the utilization of the processor and provides latency tolerance for the latency caused from 
data dependency and cache misses. 
The  Electronics  and  Telecommunications  Research  Institute  (ETRI)  in  South 
Korea  developed  an  on-chip  multiprocessor  RAPTOR  Simulator  "RapSim",  which 
contains four SP  ARC microprocessor cores in it.  To support this 4-way multiprocessor 
simulator,  Multithreaded  Mini  Operating  System  (MMOS)  was  developed  by  OSU 
MMOS group.  RapSim runs multiple threads on multiple processor cores concurrently. 
POSIX  threads  was  used  to  build  Symmetric  Muiriprocessor  (SMP)  safe  Pthreads 
Redacted for Privacypackage, called MMOS.  Benchmarks should be properly parallelized by the programmer 
to  run  multiple  threads  across  the  multiple  processors  simultaneously.  Performance 
simulation results shows the RAPTOR can exploit thread level parallelism effectively and 
offer a promising architecture for future on-chip mUltiprocessor designs. Architectural Support for Multithreading on a 4-way Multiprocessor 

by 

Gwang-Myung Kim 

A THESIS 

submitted to 

Oregon State University 

In partial fulfillment of 

The requirements for the 

degree of 

Master of Science 

Presented December 10, 1999 

Commencement  June 2000 
Master of Science thesis of Gwang-M\ung Kim presented on December 10, 1999 
APPROVED: 
Major Professor, representing Electrical and Computer Engineering 
Head of Department of Electric  and Computer Engineering 
I understand that my thesis will become part of the permanent collection of Oregon State 
University libraries. 
upon request 
My signature below authorizes release of my thesis to  any reader 
_______ 
Gwang-Myung Kim, Author 
Redacted for Privacy
Redacted for Privacy
Redacted for Privacy
Redacted for PrivacyACK1~OWLEDGMENTS 
First I thank God who guides me on the right way throughout my life. 
I would like to thank my advisor, Dr. Ben Lee, for providing me with much of my 
professional  knowledge  and  research  skills.  Without  his  generosity  of guidance  and 
patience, this thesis would not have been possible.  I would like to thank Dr. Alexandre F. 
Tenca,  Dr.  Kyoung-Rok  Cho,  and  Dr.  David McIntyre,  the  members  of my graduate 
committee.  I also thank Dr. Hantak K  wak for his contribution towards my research and 
education.  Mark N. Dailey and Kyoung Park helped me out in many ways by answering 
all of my questions about the research. 
I cannot forget to thank my mother and family for their patience, support and love. 
Special thanks goes to my wife Ye-Sun Lee who has always helped me whatever decision 
I made.  And last, I would like to thank the other graduate students and church people for 
their material  and  spiritual  support.  Without their support,  it  would  have  been  much 
tougher course. TABLE OF CONTENTS 

1.  Introduction ................................................  "  ................................1 

2.  Pthreads .......................................................................................4 

2.1  Introduction ...............................................................................4 

2.2 Architectural Requirement of Pthreads ...............................................5 

2.3 States of Pthreads ...................................................................... 8 

2.3.1  Pthreads Creation and Start .................................................. 9 

2.3.2  Pthreads in Running and Blocked State ................................... 12 

2.3.3  Pthreads Termination ........................................................ 15 

2.3.4  SMP Environmental Support ............................................... 16 

3.  RapSim (RAPTOR Simulator) ........................................................... 19 

3.1 RAPTOR Architecture .........................................  '"  ................... 19 

3.1.1  GPU ........................................................................... 21 

3.1.2  GCU ........................................................................... 22 

3.1.3  MCU ........................................................................... 22 

3.1.4  IBU, ECU, and pru ................ '"  ...................................... 23 

3.2 RapSim Features ...........................  '"  ....................................... 24 

4.  MMOS(Multithreaded Mini Operating System) ...................................... 28 

4.1  Porting Pthreads to Sun UltraSPARC ............................................. 29 

4.1.1  Machine Dependent Header Files ..........................................30 

4.1.2  Machine Dependent Codes ................................................. 31 

4.2 Integration of MMOS and RapSim ................................................ 47 

4.2.1  SMP Initialization ........................................................... 47 

4.2.2  rnachdep_cpuid ......................................................... 50 

4.2.3  Copy Registers ............................................................... 55 

4.2.4  Software Trap SUppOl1 for longjrnp....................................  57 
TABLE OF CONTENTS (CONTINUED) 

4.3 Performance Simulation Result ................................................... 59 

4.3.1  Benchmarks .................................................................. 60 

4.3.2  Instruction Distribution ........................'"  .......................... 61 

4.3.3  Execution Cycle vs IPC .................................................... 63 

4.3.4  Speed-up vs Thread Overhead ............................................. 65 

5.  Conclusions and Future Projections .................................................... 68 

Bibliography .................................................................................... 70 
LIST OF FIGURES 

Figure 
2.1  : Thread State Transitions (4]  ......................................................... 8 

2.2  : Handling of Priority Queue Enqueue .............................................. 11 

2.3  : Handling of Priority Queue Dequeue .............................................. 13 

3.1  : Block Diagram of Raptor Microprocessor [6] .................................... 21 

3.2  : Block Diagram of RapSim [6]  ..................................................... 25 

4.1  : Organization of MMOS and RapSim Integration ............................... 29 

4.2  : Example ofFP state Saving [10] ...................................................36 

4.3  : ThefRegisters [11] ................................................................. 37 

4.4  : FP Registers Saving ................................................................. 38 

4.5  : Mutex Operation [4]  .................................  '"  .............................43 

4.6  : pthread_rnutex_last Operation .............................................46 

4.7  : Assignment of Initialization Pointers .............................................49 

4.8  : RapSim Execution Flow ............................................................51 

4.9  : Operations to Get the Thread ID ...................................................  53 

4.10: Getting rnachdep_cpuid .........................  '"  ............................ 55 

4.11: Operation of Software Trap .........................................................59 

4.12: Instruction Distribution ............................................................. 62 

4.13: Cycle Time vs IPC .................................................................. 64 

4.14: Speed_up vs Thread Overhead .................................................... 66 
LIST OF TABLES 
Page 
2.1: Execution Contexts, Schedulers, and Synchronization [4]  ........................6 

3.1: Architectural Configuration [6] ......................................................27 
Architectural Support for l\lultithreading on a 4-way Multiprocessor 
1.  Introduction 
Integrated  circuit  processmg  technology  offers  increasing  integration  density, 
which  fuels  the  growth  in  microprocessor  performance.  Within  10  years  it  will  be 
possible  to  integrate  a  billion  transistors  on  a reasonably  sized  silicon  chip.  At  this 
integration level, it is  necessary to  find  parallelism to effectively utilize the transistors. 
Currently,  processor designs  dynamically  extract parallelism with  these  transistors  by 
executing  many  instructions  within  a  single,  sequential  program in  parallel.  To  find 
independent  instructions  within  a  sequential  sequence  of instructions,  or  thread  of 
control, today's processors increasingly make use of sophisticated architectural features. 
Examples are out-of-order instruction execution and speculative execution of instructions 
after branches are predicted with dynamic hardware branch prediction techniques [1]. 
Electronics  and  Telecommunications  Research  Institute  (ETRI)  is  trying  to 
develop  a  new  on-chip  multiprocessor,  called  RAPTOR,  which  has  four  2-way 
superscalar  processor  cores  and  one  graphic  co-processor  in  it.  To  understand  the 
performance  advantage  of a  mUltiprocessor  core,  they  decided  to  first  develop  an 
architecture simulator model, called RapSim (RAPTOR Simulator).  However, they did 
not yet have a proper operating system to support a multiprocessor environment.  After a 
careful consideration, more efficient and  simpler software approach to  the  OS  support 
problem  was  suggested  by  Dr.  Ben  Lee.  As  a  result,  Multithreaded-Mini  Operating 2 
system  (MMOS)  was  developed  as  a  part  of this  thesis  to  provide  ETRI  with  an 
alternative OS to test RapSim. 
Future  perfonnance  improvements  will  require  processors  to  be  enlarged  to 
execute  more  instructions  per  dock cycle.  However,  reliance  on  a  single  thread  of 
control limits the parallelism available for many applications, and the cost of extracting 
parallelism from a single tread is becoming prohibitive [1].  Simultaneous Multithreading 
(SMT) [2] can utilize the functional units of a processor by exploiting both of Instruction­
Level Parallelism (ILP) and Thread-Level Parallelism (TLP).  MMOS provides RapSim 
the function of assigning and scheduling of multiple. 
MMOS  is  a  modified  version  of the  POSIX  threads  or  Pthreads.  Chapter  2 
introduces  some  core  functions  of Pthreads  and  modifications  necessary  to  support 
Symmetric  Multiprocessor  (SMP)  safe  features.  Original  Pthreads  package  contains 
some synchronization schemes, which are shared among multiple threads.  To protect the 
use  of these  synchronization variables  among  multiple  threads,  which  are  running  on 
multiple  processors  at  the  same  time,  another  synchronization  scheme  should  be 
introduced.  Additionally,  new  SMP  initialization  function  is  required  to  initialize 
multiple processors. 
RAPTOR and RapSim architecture are introduced in Chapter 3.  RAPTOR consists 
of several core units, and General Processor Unit (GPU)  is  one of them.  RapSim was 
designed to  simulate  the  functionality  of the  GPu.  RapSim can  be  divided into pre­
processing  unit  and  post-processing  unit.  The  pre-processing  unit  generates  the 
instruction trace and the post-processing unit carries out the timing simulation. 3 
Chapter 4 presents the work devoted to porting and integration of MMOS and the 
RapSim  processor  model.  Porting  of the  Linux  version  of the  Pthreads  package  to 
SPARC is first explained.  The irltegrat10n of MMOS and RapSim was the next task to 
properly  simulate  benchmarks  on  RapSim.  Simulation  results  from  RapSim  are  also 
presented in this chapter. 
Finally, conclusion and future projections are given in Chapter 5. 4 
2.  Pthreads 
The multithreading model used in  our project is  commonly called "Pthreads", or 
"POSIX threads".  Pthreads is based on  UNIX System V and Berkeley UNIX, but it is 
not  itself  an  operating  system.  '  Instead,  Pthreads  defines  the  interface  between 
applications and their libraries.  Applications can be ported from one system to  another 
because they see only the Pthreads interface and is independent of what system is under 
the interface [3]. 
This chapter offers a basic introduction to the Pthreads and its architecture.  \Ve 
will  also  discuss  some core functions  in  the  Pthreads package with  their thread  state 
transition. 
2.1 Introduction 
"POSIX  threads"  means  the  thread  "application  programming  interface"  (API) 
specified by the international formal  standard POSIX  l003.lc-1995.  This standard was 
approved by the IEEE in June 1995 [4J. 
Threads are often called "lightweight processes", and this means  that threads are 
relatives UNIX processes even though they are not processes themselves.  To understand 
the distinction, we must examine the relation between UNIX processes and threads.  In 
UNIX, a process contains both an executing program a.l1d  a bunch of resources, such as 
the  file  descriptor table  and  address  space.  A task may  have  any  number of threads 
associated with it and all threads must be associated with some task.  A task contains only 5 
a  bunch  of resources,  and  thrc(lds  ha..TJ.dle  all  execution  activities.  Thus,  a  thread 
essentially consists of a program  count~r, a :;tack,  and a set of registers -- all  the other 
data structures belong to the task.  A UNIX process is  modeled as  a task with a single 
thread. 
Since threads are very small :::omparcd with processes, thread creation is relatively 
cheap in terms of CPU cycles.  Threads share resources, but processes require their own 
resources.  A thread gives programmers the ability to write concurrent applications that 
run  on  both  single-processor  and  multiprocessor  machines  transparently,  taking 
advantage  of the  additional  processors  when  they  exist.  Additionally,  threads  can 
increase performance in a single-processor environment when the application performs 
operations that are likely to block or cause delays. 
Most programs contain  some  kind  of concurrency,  even  if it is  only  reading  a 
command  from  the  input  device  while  processing  the  previous  command.  Threaded 
applications are often faster than sequential programs that do  the  same job.  They are 
even much easier than non-threaded applications to develop and maintain the same job. 
But threaded programming requires programmers to understand the technique to develop 
threaded applications. 
2.2  Architectural Requirement of Pthreads 
The three key features  of a thread procedure are  Execution  Context,  Scheduling, 
and Synchronization.  ·When we want to evaluate or compare any two thread systems, we 
should classify the  thread's  features  into  the qualities that support execution contexts, 6 
scheduling,  and  synchronization.  Here  are three  essential  facilities  or aspects  of any 
concurrent system [4]: 
1. 	 h-xecution Context is the'state of a concurrent entity.  A concurrent system must 
provide a way to create and delete execution contexts, and maintain their state 
independently.  It must be able to save the state of one context and dispatch to 
another at various times,  for example, when one needs to wait for  an  external 
event.  It must  be  able  to  continue  a  context  from  the  point  where  it  last 
executed, with the same register contents, at a later time. 
2. 	 Scheduling determines which context (or set of contexts) should execute at any 
given point in time, and switches between contexts when necessary. 
3. 	 Synchronization  provides  mechanisms  for  concurrent  execution  contexts  to 
coordinate their use of shared resources.  Synchronization is the mechanism by 
which threads cooperate to accomplish a task. 
Execution  Scheduling  Synchronimtion
context 
I 
tum signals and bil=r:;flic hghts Real traffic  I  t auonD  e  d  .  an 	 sIgnS  brake lights 
-, .. _--"'--­
UNIX  process  priority  wait and pipes 
(before threads) 
Pthreads  thread  condition variables  J~~ priority 
Table 2.1: Execution Contexts, Schedulers, and Synchronization [4] 7 
There are many ways to provide each of these facilities.  Table 2.1  shows a few 
examples of the three facilities in vanous systems. 
The Pthreads scheduling scheme allows each thread to mn until it reaches context 
switch point and lets the  proce~sor ueiJ1r,  available for another thread.  It provides some 
different  scheduling  policies  that  allow  the  application  to  control  how  each  thread  is 
scheduled  according  to  that  thread's  function.  There  are  some  possible  scheduling 
policies, which are First-In-First-Out (FIFO), Round-Robin (RR) , and  other scheduling 
policies.  FIFO refers to a simple queue scheduling mechanism.  Each thread is assigned 
a priority level and a priority queue is related with each thread's priority level.  When 
threads at a given priority level become available (or runable), they are pushed into the 
end of the priority queue.  Threads move in tum to the head of the priority queue where 
they can be scheduled onto the next available processor.  The RR algorithm is the same 
as  the  FIFO  algorithm  except  that  each  thread  is  given  a  time-slice  to  execute  its 
procedure.  One of other default scheduling algorithms is the one that is now supported in 
Solaris threads.  If both of a Jow priority thread and a high priority thread are waiting to 
lock the same resource, the high priority thread will always be unblocked first when the 
resource is available. 
Synchronization may be provided using a wide variety of mechanisms.  Some of 
the most common forms are mutexes, corldition variables, semaphores, and events.  Other 
synchronization mechanisms like message passing, such as UNIX pipes, sockets, POSIX 
message queues, or their protocols for communicating between asynchronous processes, 
can be used on the same system or across a network [4]. 8 
2.3  States of Pthreads 
At any given time, a thread should be in one of the four primary states specified in 
Figure 2.1.  First, thread starts its execution in the ready state.  Usually, when the new 
thread  starts  execution,  it calls  the  specified pthread_start function.  It  may  be 
preempted by other threads, or block itself to wait for external instance any number of 
times.  Finally, it completes its execution and either returns from the pthread_start 
function and calls the pthread_exit  function.  If the thread has been detached, it is 
immediately recycled.  Otherwise, the thread remains in the terminated state until other 
threads are joined or it is detached. 
wait satisfied 
created 
done, or cancelled 
Figure 2.1: Thread State Transitions [4] 9 
To be sure that resources used by terminated threads are available again to  the 
process, each thread should be always  detach~d when it finished its execution.  Threads 
that  have terminated but are  not det:lched  may retain  virtual  memory,  including  their 
stacks, as  well as  other system resources.  Detaching a thread tells the system that this 
thread is no longer needed, and allows th~ system to reclaim the resources it has allocated 
to the thread [4].  Figure 2.1 shows the diagram of these thread states transition, and the 
events that cause threads to move from one state to another. 
2.3.1  Pthreads Creation and Start 
The initial thread is created at the  saI1).e  time the process is created.  In a system 
that completely supports threaded prograrruning,  there is  probably no  way to execute 
any code without a thread.  A thread  i~, the software context that contains the hardware 
elements required to execute program: registers, program counter, stack pointer, and so 
forth.  Other threads  can be created by Pthreads specific calls.  The primary way to 
create threads on the Pthreads package is to call pthread_create function explicitly 
in  the  application  program.  This  function  creates  l:1  thread  which  is  running  the 
pthread_start function.  Fundamentally it is an asynchronous call to the function 
pthread_start with argument value argo  The attr argument specifies optional 
creation attributes, and the identificJcion of the new thread is returned in thread. 
int pthread_create  ( 

pthread_t  ·"'thread, 
10 
pthread_i:1ttr_r  *a.ttr, 
void  * (*start_routine) (void  *), 
void  *  arg); 
Unlike processes created by the UNIX fork function that begin execution at the 
same  point as  their parents,  threads  begin  their execution  at  the  function  specified  in 
pthread_createo  The  reason  for  this  i~ clear,  if threads  did  not  start  execution 
elsewhere, we would have multiple threads executing the same instructions with the same 
resources [4]. 
When a new thread is created, it has the nmnable state PS_RUNNING.  Then, the 
thread is enqued into the structure pthreact..prio_queue to be executed later by the 
processor.  However, dependi!lg on scheduling constraints, it may remain in the priority 
queue for  a substantial period of time before  b~ing assigned to the processor.  Regular 
thread  has  the  default priority  when  it is  inserted into  the  priority  queue.  Minimum 
priority is given to the idle thread arid this makes regular threads possible to preempt idle 
threads in the priority queue. 
#define  PTHREAD_DEFAULT_PRIORI'ry  8 

#define  PTHREAD_MAX_PRIORITY  15 

#define  PTHREAD_MIN_PRIORITY  0 

struct pthread-prio_level  { 

struct pthread *first; 

struct pthread *last; 

} ; 
struct pthread.J)rio_queue  { 

struct pthread *next; 

struct pthread_prio_level  level 

[PTHREAD_MAX_PRIORITY+1]; 
}  ; ] ] 
Initial State of 

priority queue 
 iI:=====~~ next  ~ Thread 1 
i 
level 0  I'  last 1------1  , 
1------1  1  -l~ level PI  -L first  Thread1 
~----I  ! 
I' 
level 15  11 
-i-f P-l<-P2-,-- 11  0- next  - 1  Thread 2 -71L.,--------I 
~===::::::  ! 
level 0  ! 
1------1  1  £" last I----li 
level PI  h- first  _I  Thread1 
-71,-~-----, I-----l  ' 
I  £" last
I--------l l 
level P2  -+- first  - 1  Thread 2 -71L-...-_---' 
1I----l 1 
II k,d 15  I 
- -f-P-l-P2--- 1 11  .n~A1 
I  =  ,  Thread 1 r-r-- -?! 
l-_l_ev_e_lO_-I I 
level PI 
~---- , ~ 
L-_le_v_el_l_5---l I 
last  last 
first  - 1  Thread 2 J.
I  J­ ~  ThreadI  next  -71L-..__ ---I 
ifPl > P2,  II  W- -71  Thread 1 next 
I;:::::le=v el=o:::::  I  last 
\l; 
first  J  Thread2  I 
./  L-_----,...._----' 
last  /1\ 
~~__ T_h_rea_d_l_~~  next  _ 
L--_  ________ _._  _  ._! 
level P2 
I
1------1 1 
level PI  .l 
1------'1 i 
I 
I----l  j 
I  level 15  I 
first  . 
Figure 2_2: Handling of  Priority Queue Enqueue 12 
Figure 2.2 shows the handling of prionty queue when a thread is enqueued with its 
priority level.  If we assume that tIle incoming thread's priority level "P2" is greater than 
the  "next" thread's  priority  level  "PI" in  the  priority  queue,  then "next" thread  is 
replaced by the new thread.  In case that the newly incoming thread's priority "P2" is less 
than equal to the "next" thread's priotity"Pl"', the "next" thread is not changed.  This 
newly incoming thread, "Thread 2", is linked to the "'Thread I" as a next thread of it. 
After a thread has been created, it will begin to execute application program.  The 
initial flow of program will bring about the execution of the pthread_create.  The 
pthread_start function is called with the argument value that was specified when 
the  thread  was  created.  When  the  pthread_start function  returns,  the  thread 
finishes  its  execution but other threads can continue to  run.  When the function main 
returns in the initial thread, the process will be terminated immediately.  However calling 
pthread_exi  t  instead of returning from main will allow other threads in the process 
to continue running even though the initial thread is terminated. 
2.3.2  Pthreads in Running and Blocked State 
A thread is  in ready state when it is  first created, and whenever it is available to 
run from the blocked state.  Ready threads are waiting in the priority queue to be used by 
a  processor.  When  a  running  thread  is  preempted  according  to  the  round  robin 
scheduling policy, the thread immediately becomes ready.  A thread becomes running 
when  a processor selects  the  thread in  the  priority queue for  execution.  Usually this 13 
means that a thread has been blocked, or has been preempted by a timer interrupt or any 
intended interrupt in the actual codes; mutexes, condition variables, and so forth.  The 
blocked (or preempted) thread saves  its  context and  restores  the  context of the  next 
available thread to replace itself. 
When processor resources are available after a thread finishes its execution or it is 
blocked, another thread will be selected from prioLity queue and changes its  state to 
running. 
Initial State: 
next  Thread 1  I  W- ~  J 

level 0 
... 
level P 
... 
level 15 
last 
~ 
ne.xt  next
first  Thread I  Thread 2  ;;["" Thread3 ~  ~ 
 r 
Dequeueing Thread 1 
next  Thread 2  ~ 
! 
I 
last 
first  Thread 2  1  ~  ~ 
!,----J
I 
I 
next 
level 0 
... 
level P 
... 
level 15 
Figure 2.3: Handling ofPriority Queue Dequeue 14 
Figure 2.3 shows the manipulation of the priority queue when a thread is  chosen 
for running. 
"Barriers"  are  the  interface  that  c<Jn  be  used  in  the  application  program  to 
synchronize the  multiple threads.  The barrier_TJIlai  t  function is a gathering point 
for the related threads, where each thread would wait until all the other threads have 
reached this point.  When the last thread arrives 3.t  the ba..'Tier, all the waiting threads are 
released.  Condition  variable  and  mutex  are  used  to  support  barrier  function.  The 
following lists some related functions [4,5]: 
• 	 Mutex Lock: 
int pthread_mutex_lock  (  pthread_mutex_t  *  ); 
This function acquires a mutex lock.  If  another thread currently has the lock, 
the caller is blocked and the thread is put on the mutex wait queue until the mutex 
holder releases the lock.  In either case, when this function returns the caller can 
safely modify whatever is protected by the mutex. 
• 	 Mutex Unlock: 
int pthread_mutex_unlock  (  pthead_mutex_t  *  ); 
This function is used to release a mutex lock when a thread has completed the 
critical  section.  If any  threads  are  waiting for  the  mutex,  one of the  waiters  is 
chosen and released. 
•  Mutex Trylock: 15 
This function  is  used for  acquiring the  mutex, but instead of blocking on a 
failure,  it returns  an  indication  that  the  mutex  is  busy.  This  allows  a caller to 
continue doing work while the leck is busy. 
•  Conditional Variable Wait: 
int 	 pthread_cond_wait  ( 
pthread_cond_t  *  I 
pthread_rnutex_t  *  ); 
Wait on condition variable until awakened by a broadcast or signal. 
•  Conditional Variable Broadcast: 
Broadcast condition variable by waking all current waiters in it. 
•  Conditional Variable Signal: 
Signal condition variable by waking one waiting thread.  If there is no destined 
scheduling policy, an unspecified waiter is awakened. 
2.3.3  Pthreads Termination 
A thread usually finishes its execution by returning from its start function, which is 
passed to pthread_create function.  Threads are terminated by calling the function 16 
pthread_exi  t  after  they  are  returned  from  executing pthread_start.  First, 
thread,  which calls pthread_exi  t, execute all  cleanup handlers to  accomplish the 
cleanup, somewhat like process a texit.  If the thread was already detached, it is set 
free  and  recycled again.  If there  are  any threads  to  be joined to  the current thread, 
threads are dequeued from the join queue of the current thread and put them into the 
priority queue again.  Afterwards, it will be remained available for another thread to join 
with it  using pthread_j oin.  This  is  put into the "zombie" queue because it still 
exists even though it is "dead". 
Main thread reaches at pthread_j oin before it finishes application program.  It 
checks  other threads'  status  at  this  function.  The return  value  of pthread_j oin 
depends if the target thread is in "dead queue" or not.  When pthread_j oin returns 
the  value  "OK"  from  the  check  if a  specified  thread  has  terminated  already,  the 
terminated thread in "dead queue" is detached and set be free,  and would be recycled 
later.  However, if it returns a not "OK" value, target thread is still in running state.  In 
that case, main thread is enqueued into the ''join queue" of target thread and wait until 
the target thread awakes it.  Main thread is awak.ened from "join queue" when the owner 
thread of "join queue" finishes its whole execution and calls the pthread_exit. 
2.3.4  SMP Environmental Support 
Unlike the original Pthreads package, the modified Pthreads package was designed 
to  support  Symmetric  Multiprocessor  (SMP)  environment.  For  this  purpose,  some 17 
modification  processes  were  added,  and  these  include  SMP  initialization  and 
synchronization scheme. 
In addition to the functions provided by Pthreads, we had to add one additional call 
to the Pthreads initialization routine: 
void srnp_init  (  void  )i 
This  is  used  to  start  the  SMP implementation.  On our Linux  test  system,  this 
function calls the clone ()  system call to clone a copy .of the caller.  The purpose of 
this function call is  to start up  the other processor(s) on the RapSim simulator.  This 
function  must be the first Pthreads function that is  called.  The caller will return and 
continue executing as the main thread of execution.  All other processors will initialize 
themselves and start up the execution of very low priority "idle" threads.  The purpose 
of the idle thread is nothing more than to waste up the processor cycles when there is no 
useful work for a processor to perform.  Note that the number of threads in a system is 
limited only by the  available memory, while the  number of processors is  fixed  via a 
symbolic constant.  However, for the best performance, it makes sense to only allocate 
as many threads as there are processors [5]. 
The  modified  Pthreads  package  was  required  to  support SMP  safe  behavior by 
adding  the  tes  t_and_set macro.  To protect the  internal  data structures  used  by 
Pthreads  itself,  we  utilize  the  concept  of  an  atomic  lock  implemented  by  the 
test_and_set  [5].  The original Pthreads package cannot guarantee that it  assist 
SMP  safe  behavior  completely.  When  multiple  processors  run  multiple  threads 18 
simultaneously,  threads  can  intrude  another  thread's  critical  section.  However,  the 
critical section should be protected by using the synchronization scheme.  Even though 
the  original  Pthreads  package  contains  some  synchronization  schemes,  it  was  not 
designed  to  be  used under multiprocessor environment.  Therefore  we  need  another 
scheme to  support this behavior without causing  any  critical  section usage problems. 
This additional tes  t._and_set macro prevents other threads intrude into the critical 
section, which is currently used by another thread, for SMP safe behavior. 19 
3.  RapSim (RAPTOR Simulator) 
RapSim  (RAPTOR  Simulator)  is  a  multiprocessing  on-chip  multiprocessor 
simulator,  which  has  been  developed  by  ETRI.  It was  constructed  using  a  basic 
uniprocessor model,  which consists of an  Instruction  Set Architecture  (ISA)  simulator 
and  a micro  architecture  simulator.  Final  mUltiprocessor  model  was  built by  bringing 
together these four uniprocessors. 
The uniprocessor model can be divided into two parts; the front-end and the back­
end.  The front-end  part is  the  ISA simulator that executes benchmarks  and  produces 
instruction trace at every execution cycle.  The back-end part generates statistical data by 
using the trace of the front-end.  These statistic data includes the calculation of execution 
cycle and other performance evaluation results. 
This  chapter  will  describe  the  RAPTOR  processor  model  and  the  RapSim 
architecture.  Much of this chapter is borrowed form the ETRI's report [6]  because this 
part was implemented and simulated by ETRI. 
3.1  RAPTOR Architecture 
RAPTOR is a "On-chip Multiprocessor" microprocessor, which consists of several 
main  parts;  General  Processor  Unit  (GPU),  Graphic  Co-processor  Unit  (GCU),  Inter­
processor Bus Unit (IBU), External Cache control Unit (ECU), Multiprocessor Control 
Unit (MeU) and Port Interface Unit (PIU).  The GPU is composed of four independent 20 
processor cores.  RAPTOR has  on~ graphic co-processor (GCU) in  it.  Inter-processor 
Bus Unit (IBU) is  a shared bus, which connects GPUs and External Cache control Unit 
(ECU).  Multiprocessor Control Unit (MCU) distributes external interrupts over all GPUs 
equally and provides GPUs with synchronization resources.  Port Interface Unit (PIU) is 
a multiprocessor-ready bus interface to communicate with the outside of RAPTOR.  Quad 
GPUs execute all instructions except extended graphic instructions.  Each GPU maintains 
its own register file and a program counter.  However, the ECU is shared through the IBU 
among GPUs.  The GCP is also shared among all GPUs and executes all the extended 
graphic instructions with SIMD (Single Instruction Stream Multiple Data Stream) type 
pixel processing hardware. 
Figure 3.1 shows the block diagram of RAPTOR.  Main architectural features of RAPTOR 
are shown below: 
• 	 Single chip 4-way multiprocessor sharing off-chip 2
nd level cache 
• 	 64-bit data and 64-bit virtual address 
• 	 SPARC V9 ISA 
• 	 Extension of graphic instruction set 
• 	 Multiple cache structure consisting of on-chip 1  st level cache and off-chip 2
nd 
level cache 
• 	 Harvard structure of 1
st level cache of 16 Kbyte instruction cache and 16Kbyte 
of data cache 
• 	 On-chip  2
nd  level cache controller handling 4  Mbyte  of unified  off-chip 2
nd 
level cache 21 
GCU 
ECU: External Cache Control Unit 
GCU: Graphical Co-Processor Unit 
GPU: General Processor Unit 
IBU: Inter-Processor Bus Unit 
MCU: Multiprocessor Control Unit 
PIU: Port Interrace Unit 
Figure 3.1: Block Diagram of  Raptor Microprocessor [6] 
3.1.1  GPU 
The GPU is a simple 2-way superscalar RISC core, which executes SP  ARC V9 
instruction set with branch folding [7].  Instructions are prefetched from the Instruction 
Cache (I-Cache) and stored into the Instruction Buffer (I-Buffer).  In order to reduce the 
overhead caused by the branch operation, branch fold technique is used during instruction 
prefetch  stage.  Two instructions  in  the  I-Buffer are  decoded  and  issued into  proper 
functional  units at  every cycle.  Reorder Buffer (ROB)  allocates entries for the issued 
instructions  to  support  out-of-order  execution.  Reservation  Stations  (RSs)  of  each 
functional  unit  resolves  the  dependency  problems  among  instructions.  After  the 
execution  of each  instruction,  the  functional  units  return  their results  to  the  Reorder 22 
Buffer through the Result Bus.  Reorder Buffer checks the status of each entry, updates 
register file, deallocate the entries of the committed instructions. 
The  register  file  of Integer  Execution  Unit  (rEU)  is  organized  as  8  register 
windows  where  each  window  has  32  entries.  Floating-point  register  handles  single, 
double and quad precision floating··point  data.  The  register file  of Floating Point Unit 
(FPU) can store either 32 entries of a single or a double precision data, or 16 entries of a 
quad  precision data.  The  graphic  register file  exists  to  support  the extended graphic 
instruction.  The graphic register file can store 32, 32-bit graphic data. 
3.1.2  GeU 
The instruction set supported by GeU is a key instruction set, which is widely used 
in  multimedia  and  signal  processing  algorithms.  The  GeU architecture  follows  the 
design philosophy of a SIMD.  The functional units of GeU can execute the 8-bit, 16-bit 
and  32-bit  packed  arithmetic  operations,  the  boolean  algebra  and  the  bit-wise 
manipulations.  Moreover, it calculates the sum of absolute pixel distance.  By using this, 
MPEG algorithm can be handled more efficiently. 
3.1.3  Meu 
To provide the SMP feature to RAPTOR, all GPUs should have equal chances for 
handling  the  external  interrupts.  Otherw;sc,  software  should  identify  which  GPU  is 23 
responsible for handling the particular intemlpt.  The purpose of MCU is to provide the 
SMP feature to RAPTOR.  MCU deals with two kinds of external interrupts.  The first 
one is  a direct interrupt that should be handled  by a  specific GPU.  The other one is 
arbitrary interrupts, which is handled by any of four GPUs.  For the arbitrary interrupts, 
MCU gathers these interrupts, and distributes them equally to all GPUs.  When GPUs 
work together to execute a very tightly coupled tasks,  which are  multithreaded,  MCU 
provides GPUs with message-passing mechanism for inter-GPU communications. 
3.1.4  IBU, ECU and PIU 
IBU is  a  shared bus that connects GPUs  and  ECU.  Each GPU accesses  ECU 
through IBU when it requires memory accesses to process the internal cache miss or a 
write-through. 
ECU gets the request through IBU and returns a proper response employing the 
modified MESI cache coherency protocol.  It is  required to  keep the cache coherency 
among ECUs through PIU. 
PIU  is  a  gate  to  RAPTOR  for  accessmg  from  outside  of it.  It provides  a 
multiprocessor-ready  shared  bus  interface  with  a  snooping cache  coherency protocol. 
PIU also provides a SRAM module interface for accessing off-chip 2
nd  level cache data 
RAMs. 24 
3.2  RapSim Features 
To evaluate several tradeoffs in designing RAPTOR, ETRI developed a simulator, 
which is called RapSim.  RapSim is a program-driven micro architecture simulator that 
contains four GPUs and a memory hierarchy shared by four OPUs.  Each GPU model 
consists of a Pre-Processing Unit as an instruction set simulator and Post-Processing Unit 
as a performance simulator. 
Main features of RapSim are as follows: 
• 	 Execution of SPARC V9 instructions and graphic co-processor instructions 
• 	 Program-driven simulator having timing information 
• 	 MUltiprocessor model consisting of four  processor core and one  graphic co­
processor 
• 	 Support for out  -of-order execution of a processor core 
• 	 Support SMT programming model 
• 	 Information gathering for performance evaluation 
Pre-Processing Unit,  shown in Figure 3.2, is  an  instruction set simulator having 
processor model for executing instructions, data structures for register files, proxy model 
for  system calls handling and  1  st level cache.  Pre-Processing Unit fetches  instructions 
and data from a shared memory hierarchy including a 2
nd level cache model, executes the 
instructions and generates an on-the-fly trace consumed by Post-Processing Unit. 
Pre-Processing  Unit  starts  the  simulation  by  loading  the  compiled  binary 
benchmark:,  which  is  statically linked  to  MMOS.  When  the  benchmark is  loaded,  a 25 
specific starting program counter is  assigned to  the processor model and trap table and 
trap handlers are initialized in  the memory model.  The stack area is also constructed in 
the  memory  model.  Then,  the  processor  starts  its  execution  by  using  the  internal 
resources; the execution units, register files and 1  sl level cache model. 
~ .. 
Binary Loader & Initialization 
Main Memory 
2nd Level Cache 
Pre-
Processing 
System  Trap 
Execution  Processor 
1  st 
Call  Handler 
Units  Model 
Level 
Proxy  Init.  Cache 
ISA decoder  "lr 
I  I 
All> 
On-the-fly -trace 
Fetcher 
Reservation  Post-
Decoder  Stations  Processing 
Cycle  Unit 
Issue  Calculator 
Executer  Reorder 
Buffer 
Write-back & commit  "l' 
Figure 3.2: Block Diagram ofRapSim [6] 
When the Pre-Processing Unit executes its instruction stream, it generates an on-
the-fly trace that is the sequence of executed instructions.  Each entry of on-the-fly trace 26 
contains enough infonnation so that Post-Processing Unit can carry out the performance 
simulation by using the on-the-fly trace as its input. 
The  Post-Processing  Unit  is  a  RISe  pipeline  model,  which  carrIes  out  the 
performance simulation by using  the  instruction  traces  generated  from  Pre-Processing 
Unit.  It is modeled after a 2-way supcrscalar including Reservation Stations and Reorder 
Buffer to support out-of-order executions. 
Two instructions in the Trace Buffer are fetched and pre-decoded in a cycle.  The 
pre-decoded  instructions  in the  Instruction  Buffer are  decoded  and  issued  into  proper 
Reservation  Station.  At  the  same  time,  the  Reorder  Buffer  is  also  updated.  Each 
Execution  Unit  (EU)  executes the  safe instructions from  a proper Reservation  Station 
resolving dependency problems.  The execution results of Execution Unit are reflected on 
Reorder Buffer, and the terminated entries of Reorder Buffer are brought into the register 
files. 
The architectural parameters used in the simulation model are shown in Table 3.1. 27 
Feature  Default Value 
Number of  GPUs (P)  4 (1,2,4) 
GPU issue width 
') 
1st level cache size  16 Khyte I-cache, 16 Kbyte D-cadle /32 bytes line 
2nd level cache size  4 Mbyte /32 bytes line 
Write update policy 
1st level cache to 2nd level cache: write through 
2nd level cache to Main memroy: write back 
1st level cache access 
latency 
1 cycle 
2nd Ie ve I cac he acce ss 
latency 
4 cycle 
Main IlXmory access 
latency 
10 cycle 
Instruction Execution 
Latency 
Integer ALU = 1 cycle 
Integer Multiply = 4 - 34 cycle 
Integer Division = 36 (single), 68 (double) cycle 
Load/Store = 1 cycle 
Control Transfer = 1 cycle 
FP Addition/Subtraction = 1 cycle 
FP Multiply = 4 cycle 
FP Division = 12 (single), 22 (double) cycle 
Table 3.1: Architectural Configuration [6] 28 
4.  MMOS (Multithreaded Mini Operating System) 
This  section describes  the  development  of the  Multithreaded Mini-OS  (MMOS) 
for  ETRI's  RapSim  project.  MMOS  was  developed  to  provide  ETRI  with  an 
environment to simulate a 4-way UltraSP  ARC V9  chip multiprocessor simulator.  The 
possibility of porting an entire multithreaded OS  kernel to the simulator was considered 
by ETRI and then discounted as cost prohibitive and  time consuming.  At the request of 
ETRI,  we  suggested an  alternative  to  use  a  modified  Pthreads,  which  would  become 
MMOS.  MMOS is  a simplified OS  to  provide assignment and scheduling of multiple 
threads among 4-way multiprocessor RapSim until threads complete their execution.  The 
MMOS  group  was  formed  and  branched  into  three  parallel  development  efforts; 
SMP(Symmetric  Multiprocessor)  Safe  Pthreads  Library,  SMP  Safe  C  Library,  and 
Interface Support between MMOS and RapSim. 
Figure  4.1  shows  general  structure  of RapSim  and  MMOS  integration  model. 
Application Programs used are the SPLASH suite of mUltiprocessor benchmarks.  The 
benchmarks are linked to the SMP safe Pthreads Library, C library, and Math library to 
run on RapSim.  The SMP Safe Pthreads library provides synchronization and scheduling 
among  multiple  threads,  which  are  enqueued  and  dequeued  from  externally  defined 
queue structure in any order.  The final component of MMOS is the  Interface Support, 
which connects MMOS to the 4-context RapSim, which simulates a quad-processor chip 
multiprocessor.  The Interface Support's primary responsibility is to schedule and assign 
threads to the hardware contexts of RapSim. 29 
MMOS 
Linked Libraries 
SMPSafe 
Math Library 
SMP Safe 
Ptbread Library  C Library 
1M'  7j\  11\­
\II 
User Application 
Program 
\f  t  Interface Support 
Thread  Tirrer  Keyboard  OS 
Scbedtilir  Interrupt  Interrupt  Servres 
RapSim  . 
\11 
Memory 
Context 
~ 
4-way  stack 
PC  Raptor  /  ....  data 
.......  ......  / 
Register File 
/  SiImlator  text 
etc.  etc. 
SlDl OS UltraSPARC Workstation 

Figure 4.1: Organization ofMMOS and RapSimIntegration 
4.1 Porting Pthreads to Sun UltraSPARC 
The first effort involved was porting MIT Pthreads to the SP  ARC V9 processor.  It 
was discovered very early on that the Pthreads'  core, written by Chris Provenzano, was 30 
not  fundamentally  SMP  safe.  That  is,  there  was  little,  if any,  support  for  atomic 
operations on a SMP. 
To make initial version SMP safe Pthreads library, Mark N.  Dailey, who worked 
on Linux version SMP safe Pthreads library, provided a huge effort in modifying original 
Pthreads to be SMP safe.  The changes made to the original Pthreads were discussed in 
Chapter 2.  However, initial efforts were given on a Linux-based Intel's SMP machine. 
Therefore,  this  Pthreads  package  had  to  be  ported  to  a  Solaris  SPARC  environment, 
where the final version of RapSim is running on.  In the process of porting to a SP  ARC 
machine, numerous modifications were required to make the benchmarks run properly on 
the  native  SPARC processor.  At  this  point  all  known  bugs  have  been  fixed  and  all 
required functions and header files were added. 
4.1.1 Machine Dependent Header Files 
The  initial  version  of the  MMOS  was  developed  on  Intel-based  SMP  running 
Linux (version 2.0.33), which allowed MMOS group to test it on an actual SMP machine 
-- a dual processor.  Afterwards, the MMOS was again extensively modified so that it can 
run on a SPARC V9 on-chip multiprocessor simulator RapSim. 
Linux  version  of the  Pthreads  package  naturally  had  links  to  several  Linux 
machine dependent header files.  Therefore, the  most of the machine dependent header 
files were retargeted for the Solaris SPARC machines already available with the original 
Pthreads package.  The SPARC version SMP Safe Pthreads package should include as 
many  SP  ARC  system  header  files  instead  of Linux  header  files  to  run  on  SPARC 31 
processor.  However, part of Linux header files, which have strong cohesion with original 
Pthreads, had to be included in  the new header files:  directory to run the Pthreads on a 
SPARC machine.  The link to  Solaris  system header files  resulted in  many problems, 
which were caused from collision among Linux and Solaris header files.  Eventually, all 
of those required Solaris system header files  \\-ere  copied into the SPARC version of the 
Pthreads package.  This independent set of header files replaced the link to the existing 
system header files. 
4.1.2 Machine Dependent Codes 
Intel and SPARC architecture have very different instruction sets.  It is difficult to 
run  the  Linux  version  Pthreads  package  on  SPARC  machine  because  the  original 
Pthreads  contains  some  of inline  machine  dependent  codes  in  it.  These  machine 
dependent codes had to be recreated or replaced by SPARC assembly codes. 
The following details the major modifications made thus far: 
For many years, developers have worked to make Solaris the most scalable shared 
memory multiprocessor operating system.  For many  applications, contention for locks 
inside  of the  kernel  is  not  an  issue;  however,  heavy  workloads  and  new  kinds  of 32 
applications  can  still  cause  problems.  It  is  possible  to  monitor  lock  contention  and 
detennine whether  or not  an  application  is  causing contention.  The  Unix  kernel  has 
many  critical regions,  or sections  of code,  where  a data  structure  is  being created or 
updated.  These regions  must not  be  interrupted by a higher-priority  interrupt service 
routine.  A uniprocessor Unix kernel manages these regions by setting the interrupt mask 
to  a  high  value  while  executing  in  the  region.  On  a  multiprocessor,  there  are  other 
processors with their own interrupt masks, therefore a different technique must be used to 
manage critical regions [8]. 
One  key  function  in  shared-memory  mUltiprocessor  systems  is  the  ability  of 
perfonning  interprocessor  synchronization  by  means  of  atomic  load/store  or  swap 
instructions.  To protect the internal data structures that are used by Pthreads itself, the 
concept of an  atomic lock is  implemented by the  test_and_set macro.  The i386 
provides  a  way  to  perfonn  an  atomic  test_and_set  using  xchg  (Exchange) 
instruction.  The original test_and_set macro was coded in the Linux version of the 
Pthreads package using Intel's inline assembly. 
xchg swaps  the  contents  of two  operands  and  takes  the  place  of three  mov 
instructions.  It does not require a temporary location to save the contents of one operand 
while loading the other.  xchg is  useful for  implementing semaphores or similar data 
structures for process synchronization [9]. 
The test_and_set function that uses the xchg instruction is shown below: 
- Linux Version: 
static inline int test_and_set(semaphore  *lock) 

{ 
33 
int  temp; 

__asffi__ volatile  ( 

II use  atomic  exchange  operation in  i386  mode 

"xchgl  %0,  (%2)"  "=r"  (temp) 
"0"  (1),  "r"  (lock); 
)  ; 
return  temp; 
} 
When  a  thread  tries  to  enter into  the  critical  section,  it  calls  tes  t_and_set 
function  to  acquire the lock.  From the  given code,  a value  initialized in the  memory 
location "lock" is read and saved in "temp".  Whenever the test_and_set function is 
called,  the  constant value "I" is  automatically stored in the  memory  location  "lock". 
xchgl  (64-bit  xchg  instruction)  copie~ the value in the memory location of %2 to 
%0 and the content of the location %2 is replaced by the value "I".  If  a thread cannot 
acquire the lock, then it busy waits until another thread releases it.  This is shown below: 
static inline void pthread_aquire(  semaphore  *lock  ) 
{ 
while(  test_and_set(  lock)  !=  0  ); 
} 
pthread_release(lock)  *lock = SEMAPHORE_CLEAR 
II SEMAPHORE_CLEAR  is defined  '0' 
All  SPARe processors have  an  instruction called ldstub, which means  load­
store-unsigned-byte.  The ldstub instruction reads a byte from memory into a register, 
then writes Oxff into memory in a single, indivisible operation.  The value in the register 
can then be examined to sec if it was already Oxff (which means that another processor 34 
got there first),  or if it was  OxOO  (which means that this processor is  in  charge).  This 
instruction is used to make mutual exclusion locks that make sure only one processor at a 
time can hold the lock [8]. 
Intel's test_and_set macro had to be replaced by the SPARC inline assembly 
macro  SEI'1APHORE_TEST_MTD_SET.  The  SPARC  version  of the  in line  assembly 
TesCand_Set and Reset are shown below: 
- SPARe Version: 
static inline int  SEMAPHORE_TEST_AND_SET 
(semaphore  lock) 
{ 
char  *p  =  (char  *)  lock; 

long  temp; 

_asm_ volatile("ldstub %1,  %0  \ 

"=r"  (temp)  \ 
"m"  (*p)  \ 
"memory" 
)  ; 
return  temp; 

} 

static inline void  SEMAPHORE_RESET  (semaphore  lock) 
{ 

char  *p  =  (char  *)  lock; 

_asm_ volatile("stb %1,  %0"  \ 

:  "=m"  (*p)  \ 
:  "r"  (SEMAPHORE_CLEAR)  \ 
II  SEMAPHORE_CLEAR  is defined  '0' 
:  "memory" 
)  ; 
} 
The store integer instruction copies a word, a least significant halfword, or a least 
significant byte from its second argument (which is all zeros) into the location specified 35 
by its first argument, which is a register containing a memory address of the lock.  8tb 
stores the least significant byte from the second argument into memory. 
B.  Introduction of FP Registers Save & Restore: 
The Intel Architecture Floating-Point Unit (FPU) provides instructions to support 
context switch.  FSA  VE (Store FP state) and FRSTOR (Restore FP state) instructions are 
used to save and restore FP state when the operating system needs to perform a context 
switch. 
FSA  VE instruction writes the current FPU state to the specified destination, then 
reinitialize the FPU.  This is typically used when the operating system needs to perform a 
context switch,  an  exception handler needs to  use the  FPU,  or an  application program 
wants to  "clean" the FPU before a subroutine uses it.  FRSTOR reloads the FPU state 
from the memory area defined by the source operand.  This data should have been written 
by a previous FSAVE instruction [9]. 
Figure 4.2 illustrates an example of an operating system saving the floating-point 
state.  The operating system maintains a save area for each thread.  There is  a variable 
that  indicates  which  thread  "owns" the  FP  state.  On  a  task switch,  the  OS  sets  the 
CRO.TS  (CRO - control register 0, TS  - thread switch bit) to  1 if the incoming task does 
not own the FP state.  Otherwise, it sets it to  O.  If a new thread attempts to be owner, 
exception Int 7 (Interrupt 7) is generated.  The Int 7 handler saves the FP state to the save 
area of the FP STATE owner (Thread A) and restores the FP  state from the save area of 36 
the current thread (Thread B).  Then, the ownership of the FP state changes to the current 
thread and CRO.TS is reset to zero. 
Application 
Operating System 
"FP state  FP state  FP state 
owner"  Save Area A  Save Area B 
ThreadA -~ 
(FP state owner)  ) 
ThreadB 
If(incoming thread != "FP state  If(CRO.TS = 1 and FP inst.) 
owner")  FSA VE "FP state owner" thread area 
CRO.TS = 1;  ~ 
FREST  0 R current -thread  -area 
Else  CRO.TS = 0; 
CRO.TS = 0;  "FP state owner" = current thread 
Thread Switch Code  !NT 7 Handler 
Figure 4.2: Example ofFP state Saving [10] 
The Linux version of the Pthreads package saves FP registers with a single inline 
assembly instruction. 
asm_  (" fsave  %0"  :  :  "m"  (*fdata)); 
II fdata:  memory  location for  FP  register save 37 
FP registers are restored 
_asm_  ("frstor  %0"  "m"  (*fdata)); 
Each thread has a memory area, which is defined in machine dependent Pthreads 
header file, to store the FP state registers. 
A SP  ARC processor includes two  types  of general registers.  The  IU's (Integer 
Unit) general-purpose registers are called r registen"  and the FPU's (Floating Point Unit) 
general-purpose registers are calledfregisters.  The FPU contains 32 32-bit floating-point 
f registers, which are numbered from f[0]  to f[31].  Unlike the windowed r registers, an 
instruction has access to any of the 32 f registers at a given time.  The f registers can be 
read  and  written  by  FPop  instructions,  and  by  load/store  single/double  floating-point 
instructions (LDF, LDDF, STF, STDF) [11].  Figure 4.3 showsfregisters in SPARe. 
flJ I] 
£[30] 
fIl] 
f[0] 
31  o 
Figure 4.3: ThefRegisters [11] 38 
SPARC  architecture  does  not  have  any  instmctions  to  save  and  restore  the  FP 
state.  An alternative way to  flUfh the FP  ~tate to  memory is required to  guarantee the 
correct  operation  of each  thread.  New  pror.edure  calls  were  created  to  replace  the 
instructions  FSAVE  and  FRSTOR  for  the  Intd architecture.  Each  content  of 32 f 
registers had to be saved to the designated memory locations by the programmer. 
Figure 4.4 shows the process of saving of FP registers in SP  ARC memory location. 
The register windows in SPARC are overlapped partially, thus the out registers become 
renamed to become the in registers of the called procedure. 
FPURegs.  Memory 
save  %sp,-120,%sp 
, tI31]  , mov  %iO,%15  , 
".& t130] st  %fsr, [%15+0J  ", 
./ st  %fO, [%15+4J  .'·'A  ....  + 128 
st  %f1, [%15+8J  v­ ....  + 124  :  , fIl]  ,  , : 
"A tIO] st  %f30, [%15+124J  '.'.,  ./  .... "A  +8 st  %f31, [%15+128J  FSR  ", ,  v­ ret  ....  +4 'A 
restore  ./ ....  fdata 
Figure 4.4: FP Registers Saving 
The incoming parameter 0, which is  the outgoing parameter 0 (%00) of caller, is 
contained  in  in  register  0  (%iO).  FSR  (Floating-Point  State  Register)  register  fields 
contain FPU mode and status information in it.  All FP registers including FSR register 39 
are  saved in a  save area of the owner thread one by one.  Restoring procedure call is 
treated in same way but has a different destination. 
C.  Jmpbuf Structure: 
This was one of the most difficult part in the process of porting MMOS (and thus 
the  Pthreads  package)  to  the  SP  ARC  machine,  since  the  SP  ARC  architecture  uses 
different type of Jmpbuf compared to the Intel architecture.  This is probably the most 
critical part of SMP safe  Pthreads  because setjmp and  longjmp calls  are  used to 
context switch among multiple threads.  This means that Jrnpbuf is used to save thread's 
state  (setjrnp)  when  one  thread  i~,  enqueued  to  the  scheduling  priority  queue. 
Conversely, it is used to restore (longjrnp) the previously saved thread's state when the 
current thread is de  queued from the scheduling priority queue for execution.  We assume 
a  mechanism such as  Unix  setjrnp/longjmp where  an  array  (called  a  jrnp_buf) 
holds the volatile state associated with an executing thread. 
The C  standard provides two functions, setjmp and longjmp, which can be 
used  to  perform non-local  gotos (A goto is  non-local  if it  branches  outside  of its 
function.  This is  not possible with the C goto statement).  They have the following 
prototypes: 
#include  <setjrnp.h> 

int setjmp (jrnp_buf  env); 

void  longjmp(jrnp_buf  env,  int val); 
40 
The call 
setjrnp (env) ; 
saves a portion of the process environment in  env, an object of type jrnp_buf.  This is 
defined in <setjrnp. h>, a header required hy the C standard. 
The call 
longjrnp(env,  val); 
resumes  execution  from  the  point  of  the  corresponding  setjrnp  (In  this  case, 
corresponding means the call to setjrnp that stored an environment in the intermediate 
variable passed to  longjrnp).  The function from which setjrnp was called must not 
have returned before the corresponding longjmp is called.  Values of global variables 
are as they were at the time of the call to 1 ongj  rnp.  Values of local variables that are 
still  in  scope  and  have  changed  between  the  call  to  setjrnp  and  longjrnp  are 
indeterminate.  The idea is that setjrnp saves information in the env argument, which 
keeps enough status for the program to resume execution at the point where the contents 
of the jrnp_buf were saved.  This presumably includes the value of important hardware 
registers such as a stack pointer or a frame pointer. 
A call to  setjrnp always returns zero.  A call to  longjrnp does not appear to 
return.  Instead, the program continues as  if the call to the corresponding setjrnp had 
just returned, but with a nonzero value.  The value is the val argument to  longjrnp, 
except that if that value is zero it is treated as if it were "I" [12]. 
The machine dependent Jrnpbuf structure in  the Intel architecture is defined in 
<jrnp_buf . h> header file by 41 
typedef  struct __jrnp_buf_base 
{ 
long  int __bx,  __si,  __di; 
---ptr_t __bp; 
---ptr_t __sp; 
---ptr_t ---pc:  ; 
#ifdef  SVR4  13i36_ABI  L1 
unsigned  long padl[4J; 
#endif 
}  __jrnp_buf[lJ; 
Pthreads  starting  address.  has  to  be  assigned  to  "---pc" and  the  proper  stack 
pointer to " __sp" at the time when Ptrlieads creates a thread.  Each thread has its own 
stack  area  in  memory.  This  stack area  should  be protected from  other threads'  use. 
When context switch occurs from one thread to  another thread, setjrnp is called and 
this system call saves the return PC (Program Counter) into "---pc" of the caller thread's 
jrnp_buf structure.  Caller thread's SP (Stack Pointer) and FP (Frame Pointer) are also 
saved into "  __sp" and "  __bp".  The owner thread, which has the tum to execute, calls 
longjrnp, and this call restores the state of  SP and FP to resume its execution.  After 
restoring the execution environment, PC jumps to this thread's saved address, which was 
the previously last PC of this thread. 
SPARC  also  has  the  system  data  structure  to  support  setjrnp and  longjrnp 
calls.  The Jmpbuf structure that is defined in <setjrnp. h> in SPARC is 
#define _JBLEN  12  /*  SPARe  ABI(Application 
Binary Interface)  value  */ 
typedef  int  jrnp_buf[_JBLENJ; 42 
This is simple but different from Intel's Jrnp_buf.  The jrnp_buf has room for 
many registers to  save the current context.  The length of jrnp_buf is defined by the 
SPARC ABI value, which is "12".  However, the only registers that need to be saved are 
the stack pointer (SP), the frame pointer CFP), and  the program counter (PC). 
We needed  a different type of initialization for  SPARC's  jmp_buf.  SPARC 
architecture accompanies ta  (Trap Always) softw(1re trap when there is  longjrnp call. 
This trap call flushes current available register windows to memory.  After proper stack 
pointer is  assigned to  jrnp_buf of a tlllead, ta instruction has to  be  used to save the 
current register windows. 
The most common and general way to synchronize among threads is to ensure that 
all memory accesses to the same data are "mutually exclusive".  That means that only one 
thread is allowed to write at a time; others must wait for their tum.  The word rnutex is a 
combination  of "mut" from  the  word  "mutual"  and  "ex" from  the  word  "exclusion". 
When data is modified, synchronization is  not so  important.  If the order in which the 
data was written is critical, synchronization is required since a thread reads data that was 
written by another thread. 
For example, a thread writes a new data to an element in an array, and then updates 
the maximum index to indicate that the array element is valid.  Another thread is running 
simultaneously  on  another  processor  that  steps  through  the  array  performing  some 
computation on  each valid element.  If the  second thread "sees" the  new  value of the 43 
maximum index before it sees the new value of the array element, the computation would 
be incorrect.  This may seem irrational, but memory systems that work this way can be 
substantially faster than memory systems that guara..'1tee predictable ordering of memory 
accesses.  A rou tex is one general solution to this sort of problem.  Ifeach thread locks a 
mutex around the section of code thar is using shared data, only one thread will be able 
to enter the section at a time [8] . 
Thread! 
Thread 2 
Thread 3 
Time: 
thread 2 waits 
thread 1 bcks 
thread 1 unlocks 
r-------------------------­
I 
I 
I  - - ­ --­- ------------- ­ . ­ - -
I  I 
I 
Figure 4.5: Mutex Operation [4] 
Figure  4.5  shows  a  timing  diagram  of  three  threads  sharing  a  mutex. 
pthread_mutex_last is  an  additional  function  to  provide synchronization among 
multiple threads at a "barrier".  A barrier is usually employed to ensure that all threads 44 
cooperating in some parallel algorithm reach a specific point in that algorithm before any 
can pass.  A  barrier is  initialized to  stop  a  certain  number of threads,  e.g.,  when the 
required number of threads have reached the barrier, all are allowed to continue [4].  This 
function  is  similar  to  the  function  pthread_IDutex_unlock  except  that 
pthread_mutex_last is only used at the exit point of barrier_wait function. 
Basically,  all  barrier_wait  functions  in  benchmarks  share  the  same 
barrier_struct. If this structure is not treated carefully and it fails to release all the 
shared lock variables before the first  thread leaves the current barrier,  first  thread can 
reenter barrier even when other threads are still in barrier.  This will cause a deadlock 
among  threads  at  a  certain  point.  Figure  4.6  shows  an  example  that  explains  this 
situation. 
The core function ofpthread_.IDutex_la.st is to make earlier threads wait for 
later threads that are still in other waiting queues in barrier_wait. Ifall threads are 
assembled in the priority queue after finishing the pthread_IDutex_last  function, 
all locks are released and the threads will not cause deadlock.  Usually, multiple mutexes 
are used to synchronize operation among threads.  Figure 4.6.a, which manages barrier 
without pthread_IDutex_last function,  shows  a  case  where  deadlock  cannot  be 
avoided.  When thread 1 exits barrier I, other threads (thread 2, 3, and 4) are still locked 
in multiple mutexes.  Thread 1 will try to enter barrier 2, which use same barrier structure 
to  barrier  1.  At this  time,  thread  1 again tries  to  get the lock of the critical  section. 
However, this mutex is  acquired by one of three other threads that are waiting to take 
their tum to run.  These threads cannot go on with their operation because the thread  1 
does not yield its tum.  Thread 1 also cannot move into the barrier 2 because the barrier 1 45 
is still locked by one of other threads in the first barrier.  As stated earlier, both barrier 1 
and 2 share same barrier structure.  Thread 1 is  busy wait until it acquires the lock for 
barrier structure.  Therefore, no  thread will  release the  resources to  the  other before it 
acquires  the  key  resources  from  the  other.  After  all,  it  brings  about  the  deadlock 
situation. 
Figure 4.6.b uses pthread_mutex_las  t  function to avoid deadlock.  When 
the thread 1 arrives at the end point of barrier 1, this thread has to wait until other threads 
arrive to the end point of barrier 1.  After all threads arrive at the end point and let the 
barrier structure is  available  to  any threads,  thread  1 restart  its  execution without any 
problems.  When thread 1 is blocked at certain point in barrier 2, it yields the tum to the 
other threads.  Eventually, we can avoid the deadlock situation. 46 
(a) Withoutpthread_mutex_last 
thread 2  thread 4 

thread 1 thread 2  priority  thread 1  . thread 3  priority 
barrier  queue 
barrier2  ' 
barrier  queue ckb 
4 

3 

barrier 2 
 2 
 .,2 
thread 2  thread 4 

thread 1 thread 2  priority  priority thread 1  thread 3 

barrier  queue barrier  queue
ckb 
1 

' 4  barrier 2 
 2 
 1 

Figure 4.6: pthread_mutex_last Operation 47 
4.2 Integration of MMOS and RapSim 
This subsection describes the integration process between MMOS and RapSim. 
The development of MMOS and RapSim were accomplished separately by OSU-MMOS 
group and ETRI.  When ETRI was ready to  run MMOS on ETRI's RapSim processor 
model, integration support between these two efforts had to be considered as a next task. 
Most of modifications for the integration had to be done by the MMOS group. 
The  benchmark  programs  were  run  on  a  SUN/SPARC  machine  with  SPARC 
version of the Pthreads package without causing any problems.  However, this Pthreads 
package  still  has  the  potential  fail  when  it  runs  on  a  real  multiprocessor  model. 
Moreover, this Pthreads package did not provide a complete support for SMP.  When we 
tested Pthreads  on  4-way RapSim processor model,  we  had to  confront  a  number of 
difficult problems.  To run the application programs on RapSim, many of the details had 
to be changed or replaced. 
4.2.1  SMP Initialization 
Intel-based  srnp_ini  t  function  initialized  processors  by  calling  clone ( ) 
system call.  clone ()  was called from clone. s, which was programmed using Intel 
assembly language.  Therefore, we needed to find a way to substitute this assembly file 
with another form of a SP  ARC equivalent. 48 
ETRI's initial version of RapSim was designed so that all four processors could be 
initialized at the same time.  Also, each processor's PC starts to count up with the same 
base PC value.  However, in the content of RapSim, it is not correct for all processors to 
start their execution with the same Pc.  This is  not  the same case as  running multiple 
programs on multiple processors that keep their own memory and do not interfere other 
processor's  memory.  Therefore,  if all  the  processors  start their execution  with  same 
starting PC, it would fail in the middle of execution because RapSim was intended to be 
shared memory.  Therefore, we had to  find  a way to share values between MMOS and 
RapSim.  These values are predefined PC locations, which are provided by MMOS at the 
special processor initialization point.  At this point, it allows RapSim to read the starting 
PC for each processor. 
We first  considered converting clone. S  to  a  SPARC  version.  However,  we 
realized that it would be time consuming and instead to  give an  initialization point in 
memory  for  each  processor  and  then  cause  an  interrupt  at  the  point  when  the  rest 
processors should be initialized.  Initially, the main processor runs a main thread on it. 
Rest of the processors can be initialized to run threads from a priority queue when the 
Pthreads scheduler is ready to schedule threads on processors.  For this operation, there 
should be some special registers or memory locations that are designated to keep shared 
information for both MMOS and RapSim. 
Figure 4.7 shows how the SPARe version smp_ini  t  saves initialization pointers 
in memory.  Each processor has stack area that is  of size "Ox10000".  Pthreads sets the 
stack  pointer  of each  processor  to  a  specified  location  in  memory.  When  RapSim 49 
perfonns the initialization of each processor,  it  \vould  try  to  read the  content of this 
memory location that was initially set by MMOS. 
Memory 
~ Slack Pointer for Processor 1 
OxffiIDOI0  ~ 
OxffiID020 
./ 
......  Stack Pointer for Processor 2 
OxffiID030  ~ 
OxffiID040  ~  ~ Stack Poniter for Processor 3 
OxffiIDI00  O(l)---~ 
Pointer to pthread_init 
Figure 4.7. AssigrtII:ent ofInitialization Pointers 
"OxffffOO1  0", "Oxffff0020", and "Oxffff0030" are reserved locations that are used 
to  save  the  stack  pointers  of  processors  1,  2,  and  3.  The  starting  address  of 
pthread_init  function also has to be saved for the purpose of using this address as 
the  starting  PC  for  all  processors.  The  pthread_init  function  contains  an 
initialization procedure for each processor whenever a processor starts its operation.  The 
value  of location  "OxffffO 1  00" holds  a  flag  to  indicate  that  the  remaining  processors 
(processor 1, processor 2, and processor 3) can be initialized at a given point.  Initially, 
this  location  is  filled  with "0" by  RapSim to  make  only one processor active.  Each 
processor is  in tum given execution cycles.  At  this  time,  the  processor checks if the 
"flag" location is set to "1".  The processor is not allowed to execute instructions until the 50 
read value is  set to  "1" by  the main  processor,  \Vhen  the  value is  set  to  "1", all  the 
remaining  processors  can  be  initialized  to  nm threads.  There  are  no  threads  in  the 
priority  queue  initially  when  the  main  processor  starts  to  execute.  Therefore,  all 
processors,  except main processor,  should run  idle threads  on  them.  Idle  thread does 
nothing, but check if any executable threads are III the priority queue. 
After the  main  processor creates  executable  threads  and  put  them  III the  priority 
queue,  Pthreads  assIgns  a  thread  from  the  priority  queue  to  a  processor.  RapSim 
execution flow is shown in Figure 4.8. 
4.2.2  machdep_cpuid 
As  stated earlier, the Intel-based Linux machine used is  an  actual dual processor 
machine.  To identify the kind of thread that is running on a particular processor, we need 
to specify the thread ID and the CPU-ID whenever a thread restarts its execution after a 
context switch.  The set-pthread_self function in the Pthreads package is used in 
Pthreads package to specify the thread ID and the processor ID.  The pthread_self  is 
used to acquire the identification of a thread. 
Pthreads has a special array structure in its header file  to keep the list of threads 
that are created to run the application program.  Each element of this array structure holds 
the specific thread's information that is used for the purpose of thread handling in many 
Pthreads functions.  The structure has the following form: 51 
Pthreads execution flow 
priority queue  r 
~ 
Processor  Processor  Processor  Processor 
1  2  3  4 
main  These are not initialized 
thread  II  II 
., , § 
Processor  Processor  I ~sso, I 
Processor 
call  - - 1  2  4 
'smp_init'  main  These are  initialized 
thread  II  II 
§ 
Processor  Processor  Processor  Processor 
1  2  3  4 
main  ide  ide  ide 
thread  thread  thread  thread 
call  ., , § 
Processor  Processor  Processor  Processor 
'pthread_ ­ -
1  2 ·  3  4 
create'  main  ide  ide  ide 
thread  thread  thread  thread 
~ , 
'pthread_ ­ ,...... 
create' 
~ 
~ , 
'pthread_ ­ ~ 
create' 
§ 
Processor  Processor  Processor  Processor 
1  2  3  4 
main  thread  thread  thread 
~ ,  thread  1  2  3 
Figure 4.8: RapSim Execution F low 52 
typedef  struct  { 
pthread  self; 
int  cpuid; 
int  state; 
}  pthread_self_t; 
self is defined as the pthread descriptor and it contains all the information to 
be used by a thread during program execution.  cpuid gives the caller the processor ID 
on which current thread is  running.  Processor ID is  uSed  to acquire a correct thread ID 
from  pthread array  structure.  state contains  the  current  state  information  of a 
thread.  If the state is in "PS_RUNNING", it indicate that the present thread is a runnable 
thread. 
states, the present thread is unable to be run.  This means that it is still in one of the other 
waiting queues or it is just an idle thread that does not contain an executable code. 
There is no way to get a thread's ID unless either the creator or the thread itself 
stores  the  identifier  somewhere.  setJ)thread_self  function  is  called  when 
processors are initialized or when a context switch occurred among threads to save and 
restore  thread's  ID.  When  processors  are  initialized,  setJ)thread_self puts  a 
current thread in the pthread array structure.  Likewise, context_swi  tch function 
calls  setJ)thread_self routine  to  place  a  newly  starting  thread  into  the  array 
structure.  To  perform  the  specified  functions,  setJ)thread_self  calls 
machdep_cpuid to  acquire  the  processor  ID.  This  processor  ID  is  used  later  to 
identify a thread that is running on the current processor.  All the information saved in the 
array structure is  exploited by a thread to  get its  own  ID  using  the pthread_self 53 
function.  After matching up the processor IDs in the pthread array, pthread_self 
returns the thread pointer using this processor ID.  The returned thread pointer is used as 
a thread ID in  the Pthreads function.  Figure 4.9 shows the operations to get the thread 
ID. 
pthread_selCt  self  -I  cpuid  state 
Initial pthread Array 
[0]  [1]  [2] 
Processor 3 calls seLpthread_self 
cpuid =3  [0]  [1]  [2]  [3] 
Processor 1 calls pthread_self 
[0]  [1]  [2]  [3] ­
= cpuid  1 
I I  0  I I I  2  I I-I 
returns .'self' of'pthread  [2]' 
Figure 4.9: Operations to Get the Thread ID 
The  machdep_cpuid  function  is  used  to  return  a  processor  ID  to 
set-pthread_self or pthread_self.  getpid UNIX system service function, 
which returns the process ID  for  the  calling process, was  used  for machdep_ cpuid 
from  the  original  Pthreads.  However, RapSim  is  not  real  multiprocessor and  instead 54 
contain  multiple  processor  contexts.  Therefore,  getpid system  call  is  no  longer 
adequate  for  RapSim.  An  alternative  way  had  to  be  considered  to  implement  this 
function.  One possible method is to get a processor ID using certain shared location that 
is  set and checked by RapSim and MMOS.  Therefore, a special SPARC registers was 
used to achieve this goal. 
SPARC provides for up to 31  Ancillary State Registers (ASR's) numbered from 1 
to 31.  ASR's numbered 1-15 are reserved for future use by the architecture and should 
not be referenced by software.  ASR's numbered 16-31 are available for implementation­
dependent  uses  including  timers,  counters,  diagnostic  registers,  self-test  registers,  and 
trap-control registers [11]. 
When  a  particular processor  is  initialized,  all  required  register  files  should  be 
initialized as well.  Register files include integer register file,  floating-point register file, 
and ASR' s.  During the initialization, one of 1-15  ASR is  set to the processor ID of a 
processor that is being initialized.  We picked No.10 ASR because it was not likely to be 
used often by hardware during its execution.  From the actual execution, No.1, 2, 4, and 5 
ASR's  were  used  by  hardware.  When  thread  encounters  set-pthread_self or 
pthread_self while  it runs  an application program,  it  calls machdep_cpuid to 
acquire  the  current  processor ID.  At  this  time,  machdep_cpuid tries  to  read  the 
processor  ID  from  the  No.lO ASR,  which was  used  to  store the  processor ID  during 
initialization.  Processor ID  is  returned to set-pthread_self or pthread_self. 
Figure 4.10 shows the process of getting machdep_cpuid. 55 
'inachdep_cpuid"  8 
MMOS  ~  1  ' ASR Register ]<  RapSim 8  Processor ID  Processor ID  . 
Figure 4.10: Getting machdep_cpuid 
4.2.3  Copy Registers 
Raptor processor structure consists of various types of registers.  These registers 
need to be initialized when a new processor begins its execution.  Followings are some of 
these registers that require initialization [11,13]: 
• 	 Current Window Pointer (CWP) - A counter that identifies the current window 
into the r registers.  The hardware increments the CWP on traps and SA  VB 
instructions, and decrements it on RESTORE and RETT instructions. 
• 	 Savable Windows (CANSA  VB) - The CANSA  VB register contains the number 
of register windows following the CWP that are not in use and are therefore 
available  for  allocation  by  a  save instruction  without generating  a Window 
Spill trap. 
• 	 R~storable \Vindows (CANRESTORE) - Exactly the inverse of the CANS A  VB 
register;  this register contains the  number of register windows preceding the 
CWP  that  are  in  use  by  software  and  can  be  restored  to  via  the  restore 
instruction without causing a Window Fill trap. 56 
•  Other  Windows  (OTHERWIN)  - The  OTHERWIN  register  contains  the 
number of register windows in the ring outside of those accounted for by CWP, 
CANS A  VE, CANRESTORE and one overlap window (akin to  a trap window 
under v8).  The windows covered by OTHERWIN are used to spill or fill when 
CWP moves beyond CANSA  VE or CANRESTORE respectively. 
• 	 Clean Windows (CLEANWIN) - This register contains the number of windows 
that are  "clean"  from  the  perspective of the  current program,  either because 
they are zeroed or because they contain valid data/addresses for that programs 
address space.  When a clean window is  requested via a save instruction and 
none are available, a Clean Window trap occurs to cause the next window to be 
scrubbed. 
• 	 Window  State  (WSTATE)  - The  WSTATE  register  contains  two  fields, 
OTHER and NORMAL.  In each of these fields, the bits are used to select one 
of eight different trap vectors for spillffill exceptions.  If OTHERWIN =0 at 
the time of a trap, then the bits in the WST  ATE. NORMAL field are used to 
determine the trap vector, otherwise the WSTATE.OTHER bits are used.  This 
can be used in conjunction with the OTHERWIN register to segregate one or 
more contiguous windows for an  alternate address space to  the current one, if 
the supervisor software so decides. 
• 	 Trap Base Address Register (TBA) - A trap is a vectored transfer of control to 
the supervisor through a special trap table that contains the first 4 instructions 
of each trap handler.  The trap base address of the table is  established by the 
supervisor software, by writing the TBA. 
• 	 Processor State Register (PSR)  - The 32-bit PSR contains various fields  that 
control the processor and hold status information. 
In 	addition  to  these  registers,  Integer Registers  (IR),  Floating Pointer Registers 
(FPR), Program Counter (PC),  Ancillary State Registers  (ASR), and other related state 
registers should be initialized as well. 
When the main processor arrives at the processor initialization point, the remaining 
processors  can  start  their  execution.  At  this  time,  if the  register  files  of waiting 
processors  are  initialized  in  the  same  way  that  the  main  processor  register  files  are 
initialized, these processors should be started from the first line of application program. 
However, these processors are supposed to run from the predefined location by Pthreads. 57 
Moreover,  a  newly  initialized  processor  cannot  start  its  execution  without  any 
information about previous execution of main processor because it starts its execution in 
the middle of program unlike the  main processor.  Therefore, all  registers of the  main 
processor should be  copied into  the  new  processor's register files  to  run  the  program 
correctly.  However, PC and C\VP are not induded in this copying process because each 
processor has to hold its own PC and CWP for execution. 
At any given time, only one window is visible as  determined by the CWP.  This 
can  be  incremented  or  decremented  by  the  SAVE  and  RESTORE  instructions, 
respectively.  These instructions are generally executed on procedure calls and returns.  If 
the CWP of main processor is copied to the newly initialized processor's CWP, this new 
processor would try to execute the current procedure of the main processor.  Therefore, it 
will attempt to act as if it were the main processor and will try to bind other processors 
again.  Consequently, the newly initialized processor will not properly execute. 
4.2.4  Software Trap Support for longjrnp 
Saving thread's state is  essential for  the thread scheduler to  execute the context 
switch.  In order to perform a thread context switch, we need to save the old thread's state 
and restore the state of a new thread.  The stack for a thread can be cached in  register 
windows while the thread is running, and then is  flushed to memory when the thread is 
suspended.  Various events can cause the register windows to be "flushed" to memory, 
including most system calls.  A programmer can force  this update by  using "Software 
Trap". 58 
RapSim was modeled after the core SP ARC architecture.  Thus, it performs very 
similar functions as  with the original SPARe.  Even though RapSim had "saving thread 
registers" function, it could not support context switch correctly among multiple threads. 
This  function  had  to  be  modified multiple  times  to  remove  possible  problems,  which 
were caused from executing Trap Always (ta) instruction.  Especially, applying the ta 
instruction with "Software Trap" was not well supported for RapSim.  "Software Trap" 
was  closely  related  with  longjmp  system  call.  longjmp  uses 
"ST_FLUSH_  WINDOWS", which is  one of software traps, to save the current register 
windows on the current stack and to ensure that the registers of the context it is jumping 
to are on the stack and not in the register windows. 
A  context  switch  must  save  not  only  the  CUlTent  register  values,  but  also  any 
register sets that are currently in register windows.  All of the register sets will contain a 
stack pointer, so they can be flushed out to the kernel save areas on traps and interrupts. 
Thus, we use a trap that does nothing but flush the register windows out to memory: 
#  include <machine/trap.h> 

/*  Software Traps  */ 

This trap call writes each of the register sets out to memory in the kernel save area. 
The ST  _FLUSH_WINDOWS trap can be used anytime until the old thread stack pointer 
is saved and the new thread stack pointer is loaded. 59 
Stack Space 
of Processor 3 
longjmp : 
taST_ 
Id [%0 
Id [%0 
sub% 
RUSH_WINDOWS 
0], %07 
0+4], %fp 
fp, -64, %sp 
i 
I 
! 
I• I 
I 
I 
I 
I , 
I 
I 
: 
Register Window 
, 
/'  .~~~ 
-/ 
+

Stack Space 
of Processor 2 
+ 

0Yosp 	 Stack Space 
of Processor 1 
0Yofp +

Stack Space 
of Processor 0 
Figure 4.11: Operation ofSoftware Trap 
4.3 	 Performance Simulation Result 
The simulation results and the performance of the on-chip multiprocessor will be 
explained in this Chapter.  Instruction Per Cycle (IPC), execution cycles, speed-up and 
thread  overhead  were  measured  for  the  performance  evaluation.  Three  types  of 
simulation were performed using the given benchmarks.  The first one was the simulation 
of the non-threaded versions of the 'benchmarks on a single GPU to provide a base for 
comparing  the  other two  simulation  results.  The second  and  third  simulations  were 
performed  using  threaded  versions  of the  benchmarks  on  two  GPUs  and  four  GPUs 
respectively. 60 
4.3.1  Benchmarks 
Some benchmark programs  were  selected  from  scientific  applications  including 
Matrix  Multiplication  (MMULT),  Gaussian  Elimination  (GE),  LU,  Fast  Fourier 
Transformation (FFT) and MP3D.  Each benchmark is briefly described in [14,15]: 
• 	 MMVLT : parses the matrix data into blocks and assigns them to threads.  The 
data  set  for  the  threads  is  relatively  disjoint,  whereas  the  row  by  column 
operation  produces  considerable  overlapping  of  data  among  threads. 
Moreover, there is no inter-tread communication or synchronization. 
• 	 GE: partitions n-by-n matrix into threads by using the row-wise block-cyclic 
approach.  Initially one thread performs the division step with its pivot value 
and then all  other threads perform an  elimination step.  These two steps are 
coordinated with barriers.  GE threads tend to have very separate and distinct 
data sets with minimal data sharing besides the pivot value. 
• 	 LV: The  LU  kernel  factors  a  dense  matrix  into  the  product  of a  lower 
triangular and an upper triangular matrix.  The dense nXn matrix A is divided 
into  an  NxN array of BxB blocks  (n  =NB)  to  exploit temporal  locality on 
submatrix elements.  It is  similar to GE algorithm except this generates lower 
triangular. 
• 	 FFT: The FFT kernel is a complex I-D version of the radix-Fn six step FFT 
algorithm, which is optimized to minimize interprocessor communication.  The 
data set consists of the n complex data points to be transformed, and another n 
complex data points referred to  as  the  roots  of unity.  Both sets  of data are 
organized as  FnxFn matrices partitioned so that every processor is assigned 
a  contiguous  set of rows,  which  are  allocated  in  its  local  memory.  Every 
processor transposes a contiguous submatrix of Fn /px Fn /p from every other 
processor, and transposes one submatrix locally. 
• 	 MP3D: MP3D is a simple simulator for rarefied gas flow over an object in a 
wind tunnel.  The algorithm is primary occupied with a loop consisting of three 
phases.  Each thread is  given particles and  proceeds to  move them within a 
defined cell space.  The thread continuously detects any possible collisions of 
its molecules with other molecules and updates the geometry of molecules each 
time step.  MP3D contains data that is very localized and shares much of that 61 
data among threads.  Also, each phase has to be completed by all the threads 
before continuing the next phase, requiring a larger amount of synchronization. 
4.3.1  Instruction Distribution 
On a multiprocessor system, multithreading allows a process to perform more than 
one independent computation at the same time [4].  This makes it possible to reduce the 
workload of processors.  In general, a workload (W) consists of a sequential part (Wseq.) 
and  a parallelized part  (WparalleD.  Wparallel  can be partitioned into a number of parallel 
threads.  Woverhead is the overhead that is obtained from the workload parallelization.  W seq. 
is the workload got from the part that should be performed sequentially in the application 
program.  MMOS  divides  the  parallcl part of the  workload into multiple threads  and 
distributes them to GPUs. 
Figure  4.12  shows  the  effect  of multithreading  on  the  multiprocessor  RapSim. 
When the  RapSim exploits  m'Jltithreading,  instructions  are  distributed  to  the  multiple 
processors.  As the number of GPUs is increased, the instructions were distributed evenly 
over all  the  GPUs.  However, the FFT contains  about 45%  of sequential  part,  which 
should be computed sequentially.  Therefore, we can see that all GPUs were relatively 
equally exploited in most cases, except EFT, by employing multithreading. 62 
"FF[ 
N=256  N=1024  N=16384
18<XXXXX) -r-----::-----...,-----, 
~ locnooo+-~----~~~~-;~H 
.g  1400XXXl +---=-'"""-......;;..-=.:.;.;.:.:....:..-:.:..,:...-"---t:t-..H
2 I~+--~~~~~~~~~~l, 
~ IOXOXO+-~~--~~~~~~~I~ 
::  8<XXXXX) +---'-=-:-'---:c-'-::::-=---::---7"--=---IHJH 
~  ocnooo+-~~~~--~~-~H ., 
.t:>  400XXXl +--..,..:-:...:=.:-=---.-..,........tt---t.H 
§  2<XXXXX) +------=--tH H'I---I~H 
z  O+-~~~~~T-~~~~~~ 
70x0x0 
groxxxXXJ 
.~ 5<XXXXXX) 
::> 
~ 4OOXXXX) 
c 
~ 3OXOXO 
]  200XXXXJ 
§  I(xxxxxx) 
::;;  0 
-
-
I~ • ~ 
MP3D 
!oI 
I 
r;tM  (l  f  'If 
1  2  4  1  2  4  124  12 4  124  124  124  124 
N.mI:er of<PUs  r-mm  of<PUs 
ill  MMLT 
N=128  N=256  N:512  N=1024 
1200XXXXXXl 6CXXXXXXXXl 
'"  .g 1<XXXXXXXm 
U  .~ 5<XXXXXXXX> 
roxxxxxro 2 ~ 4<XXXXXXXX>  ;;; 
.5  fOOXXXXXX) ]  3<XXXXXXXX)  ... ... 
0 
~ 2<XXXXXXXX)  ...  4OOXXXXXX> 
OJ  .t:> '"  .t:> 
E  200XXXXXXl E  i <XXXXXXXX) 
::l ::l 
Z Z  0 0 
N=64  N=128  N=!l56  N:512 
2  4  124  1  2  4  1  2  4 
N.Iri:c" of (FU; 
Irutnnion Ilstrilllion 
124  1  2  4  1  2  4  2  4 
r-mmof<PUs 
N=128  N=256  N=1024 
~ 35OOXXXXX) +-___--:".-..,,--.:--....,...,----n-I--.l 
.g  3(XXXXXXXX)  +---::;:.;.:.;..-=-::""----:-~~:::':_-iHl-l
u 
.E  2S<XXXXXXX)  +--':':i---'--=,.-,--':='::"""':;~ "::""---&1--.l 
]  2<XXXXXXXX)+-~~~=-~~~-=---f~~ 
'0  15<XXXXXXX)  +---'-'-~'-'---'--":':""'-=-'=;"--Q-lI~ 
]  1<XXXXXXXX>  +---"~-""":':.,...-----+.Htf--IIIH 
§  5<XXXXXXX) +----------;:;-;;;;-a-iJ-i1'HJ1 z 
124  124  124  124 
r-mm  of <PUs 
II GPUO  II  GPU 1 

lvF.3D  ill  MMLT  G\~ 
80% 
«:Ro 1t1-fHI!--R-.t t--tHH I--ii1-"; t--t!HH 
124124124124124 
N.nttr  of<PUs 
D  GPU2  D  GPU3 

Figure 4.12: Instruction Distribution 63 
4.3.2  Execution Cycle vs IPC 
The  one  of  most, important  performance  characters  in  the  microprocessor 
simulation  is  the  IPe.  IPC  can  be  described  as  the  number  of instructions  at  each 
processor cycle. 
Figure 4.13 shows the scalability of IPc.  From the actual simulation, the IPC was 
linearly increased as  the number of GPUs  increased.  However,  the  IPC  of FFf was 
increased slightly even though it was simulated using multiple processors.  This results 
due to  the  large sequential part of FFT.  The IPC of RAPTOR ranged from 3.5  to 4.1 
when the benchmarks were simulated on 4-way RAPTOR model.  As for a single GPU, 
the IPC ranged from 0.9 to 1.1.  For the case of FFT, the IPC of a single GPU was about 
0.9 and the IPC of 4-way model was about 1.5.  The major factor of the IPC scalability 
was  the proportion  of the  parallel  part  in  a  program.  Therefore,  if the  benchmarks 
contain  a  large  amount  of  parallel  portion,  we  can  expect  to  achieve  better  IPC 
scalability. 
The  number  of  execution  cycles  is  another  critical  factor  to  evaluate  the 
performance.  Figure 4.13 shows that the number of execution cycles was reduced as the 
number of GPUs increased.  However, the number of execution cycles was increased for 
MP3D.  As shown in Figure 4.12, the number of instructions was increased as the more 
GPUs  are  employed.  This  means  that  the  overhead  from  parallelization  caused  the 
number of execution cycles to increase for MP3D. 64 
~~----------------~ 
. 
.  -
. 
o-H-t-+-t-t--+-t--t-t-f'IL!--t-fl...-t 0 
ill 
N=128  N:;256  ~12  N=1024 
4WIDlD  5 
<ffiIlllll) 
II) 3'iJllllll)  - 4 
~~  / "V ':  3U ft=/  .·­ :  "  2e:: 
u larrm:m  ",
~  .  .  1 
o  0 
l5DllIIIID 
II) 
.5 1([111llllO
1-< 
II) 
~  :mmxrn  u 
0 
Mvl.IT 
N{j:I  N=128  N:256  ~12 
~ . •.  ,';Il
,;f::,....  , 
4 
3 
2~ 
1 
0 
124  124  124  124  124  124  124  1 24 
Nmm:c:iCfU  NnmcfOU 
cycleThre.vs IfC 
Iff  M>3D  ill  M>U.T ~ 
G\lSS 
N=64  N=128  ~12  N=1024 

4 
 lEr12  5 '\'  ­
lEr10  4 II)  3mnxro -'  II)
E:n:xxmm  '  E lE!(B  I  ' ;~  .  ;.~;t -.,>  "if
E=  L5llXIDD  .  E=laxm  ';  3 U <KllIDlXDjjj 
3 
u  0  2­
~:mmxm  -.,  ,  2g  II)  Q., 
c." >'lmmID  •  Clam 
u  la:mxxXD  1m mmID 
124124124124  124  124  124  124  124 
NmberofCfU  N.IrJ-MofIfC 
-
Cycles  •  IPC 
Figure 4.13: Cycle Time vs IPC 
AOIQDr---------------­
2 
l.5~ 
1 
. ... 
05 
o+-t-f-'-t-fIIat-iAf-"""""+-1H"'LfAt'1L!- 0 
124  124  124  1 2 4 
Nmrec:iCfU 
, ---------------------------------­
M?3D 
El5..1IDXl 
E:: 
II)  11UXXID 
U  !. 
.. 
'f  t 
.. 
11 
:.
G~um 
o  , 
124  124  124  124 
NmberofCfU 
4 
3 
2~  ..... 
1 
o 65 
4.3.3  Speed-up v~ Thread Overhead 
A  computation-intensive  threaded  application· running  on  two  processors  may 
achieve  nearly  twice  the  performance  of  a  conventional  single-threaded  version. 
However,  there  is  always  some  overhead  due  to  creating  the  extra  thread(s)  and 
performing synchronization [4]. 
The total execution time of a sequential workload W can be expressed as  T  =Tseq. 
+  Tparalle/,  when  Tseq.  and  Tparallel  denote  the  execution  time  of  ""seq.  and  Wparall 
respectively.  Tseq.  is  same  to  that  of the  sequential  programming  model.  Tparallel  is 
divided  into  the  number of threads  and  threads  are  distributed  over the  GPUs.  The 
execution  time  of thread  overhead,  Toverhead  includes  thread  management,  inter-thread 
communication and synchronization. 
The  performance  gain  depends  on  two  major  factors;  available  parallelism 
(WparalleD, which is inherent in the benchmarks, and the thread overhead (Woverhead), which 
is caused from multithreading.  If  the applied benchmarks do not contain enough amount 
of parallelism in it, we cannot obtain better utilization of GPU compared with the case of 
sequential execution.  Even the benchmarks contain enough amount of parallelism, the 
thread overhead may restrict the overall performance. 
Figure 4.14  shows  the  Speed-up  as  the  number of GPUs  increased.  When  the 
number of GPUs is increased, the Speed-up is  also increased except the case of MP3D. 
Even  though  MP3D  contains  enough amount of parallelism  in  it,  heavy  data  sharing 
generates  very  frequent  inter-thread  communications  and  synchronization.  This 
dominant thread's overhead restricts the Speed-up.  The Speed-up for FFT is not so high 66 
FFT 
-0 
<':l  v ..s::: ...  v 
;> 
0 
00% 
ro% 
40% 
:.n% 
(YJh 
N=256 
124 
N=1024 
124 
N=4096 
124 
N=16384 
2 
1.5 
0.5 
0 
124 
a. 
;:> 
-0  v  v  a. 
U) 
-0 
<':l  v ..s::: ... 
<l) 
;> 
0 
100% 
80% 
6Y!o 
4U% 
20% 
(YJh 
Nunber of  CPUs 
ill 
-0 
<':l  v ..s::: ....  v 
;> 
0 
100% 
~ 
80% 
7(YJh
ro% 
~ 
40% 
~ 
20% 
1(YJh
(YJh 
N=128  N=256  N=512  N=1024 
124  124  124  1 24 
NunberofCPUs 
GAUSS 
124  124  124  124 
Nurrber ofCPUs 
IPC -
4  100% 
~ o 
3  a. 
~ 
2"2  v  a. 
1 U) 
-0 
<':l  v  ..c ....  v 
;> 
0 
00% 
7(YJh
6Y!o 
~ 
400/0 
~ 
~ 
0 
1(YJh
(YJh 
MP3D 
N=200  N=300  N=400  N=500 
124  124  124  124 
Nurrber of  (PUs 
MMULT 
N=64  N=I28  N=256  N=512 
1 24  124  1 24  124 
Nunber of  CPUs 
rn: .vs Speed-up .vs Thread Ovemead 
FFT  MP3D  LV  MMULT GAUSS 
1.2 
1 
a.
0.8  ~ 
0.6  "2 
v 
0.4  ~ 
0.2 
0 
5 
4 
a. 
3~ 
-0 
2  ~  a. 
U) 
1 
0 
N=I28  N=256  N=512 
3 
2"2 
1 U) 
5  100%  a. 
~4 
a.  -0 
~  ~ 3 a. 
U) 
v  25  2 a. 
~1 -
124  124  124  124  124 
Nurrber of  (PUs 
Speed-up  Overhead • 

o r .. ~.11'''''''''~.!; 
:.n% 
(YJh 
Figure 4.14: Speed-up vs Thread Overhead 67 
compared with other results.  Because of a large amount of sequential part of FFf, the 
!PC scalability and Speed-up are limited and the GPU utilization is not as good. 68 
5.  Conclusions and Future Projections 
On-chip multiprocessor is a promising candidate for a billion-transistor system [1]. 
From  the  performance  simulation  results,  when  GPUs  are  running  multithreaded 
benchmarks,  we  can  obtain  IPC  that  rarl~e~  from  3.5  to  4.1  using  four  GPUs.  In 
comparison,  sequential  versions of benchmarks on single  GPU results  in  IPC  ranged 
about 0.9 - 1.1.  For the case of FFf, which contains a large proportion of sequential 
portion,  the  creation  of multiple  threads  does  not  re~ult in  better  IPC  as  the  size of 
application program becomes larger.  If we use the programs that cannot take advantage 
of the  increased parallelism,  the  increased machine cycle time  lowers  the  benefits  of 
multiprocessor usage.  This means that an application program should contain sufficient 
amount of parallelism in it. 
Speed-up  is increased linearly for the most of the benchmarks as  the number of 
GPUs is increased.  Even if benchmarks arc sufficiently parallelizable and can achieve 
high IPC, we cannot obtain a better Speed-up if the generation of multiple threads results 
in the larger thread overhead.  A large amount of the thread overhead of MP3D reduces 
the Speed-up even when the number of GPUs is increased.  This means that Speed-up 
depends on thread overhead as well as the inherent parallelism of benchmarks. 
Even though the current version of MMOS provides RapSim, it does not support 
complete features of multithreading for RapSim.  When a  cache miss  occurs, MMOS 
does not support context switch function.  Since it takes many cycles to fetch data from 
the  memory,  context switch can replace currently  running  thread  with  another  thread 
when a cache miss occurs.  This would allow RapSim to tolerate long memory latencies 69 
and then can increase the performance of the processor.  As processor becomes faster, the 
processor performance depends how well memory latency is tolerated.  In case of Cache­
Coherent multiprocessor model, the role of cache is more important because it results in 
more cache traffic for shared data among processors.  ETRI's future plan for RapSim will 
be to implement a Cache-Coherent, Non-Uniform Memory Access mUltiprocessor model. 
Therefore, additional functions for  the cache-coherent model should be incorporated on 
MMOS to obtain a better performance. 70 
Bibliography 
1. 	 L.  Hammond, et aI., "A Single-Chip Multiprocessor,"  IEEE Computer, Vol. 30, No. 
9, PP. 79-85, 1997. 
2. 	 S.  Egger,  et  al.,  "Simultaneous  Multithreading:  A  Platfonn  for  Next-Generation 
Processor," IEEE Micro, Vol. 17, No.5, PP. 12-19, 1997. 
3. 	 D. Lewine, POSIX Programmer's Guide, O'Reilly & Associates, Inc., 1991. 
4. 	 D. R. Butenhof, Programming wzth POSIX Threads, Addison Wesley, 1997. 
5. 	 B. Lee, "Architectural Support for Multithreading on Quad-processor Multiprocessor 
Chip," 1997-1998 FY Mid-Annual  Progress  Report prepared for Electronics  and 
Telecommunications Research Institute. 
6. 	 K.  Park,  "On-chip  Multiprocessor  with  Simultaneous  Multithreading,"  In 
Preparation. 
7. 	 D.R.  Ditzel  and  H.R.  McLellan,  "Branch Folding  in  the  CRISP Microprocessor: 
Reducing Branch Delay to Zero," 14th Annual Symposium on Computer Architecture, 
pp.2-9, 1987. 
8. 	 Unlocking  the  Kernel,  SunWorld,  1998,  available  at  http://www.sunworld.comJ 
sunworldonline/swol-08-1998/swol-08-pcrf.html. 
9. 	 P. Brumm, et al., 80486 Programming, Windcrest, 1991. 
10. MMXTM  Technology  Programmer's  Reference  Manual,  Intel,  1996,  available  at 
http://developer.intel.comJdrg/mmxIManuals/prmlPRMCHP4.HTM. 
11. SPARC International, Inc., The SPARC Architecture Manual version 9, 1994. 
12. F. 	Zlotnick,  The  POSIX  Standard:  A  Programmer's  Guide,  Benjamin/Cummings 
Publishing Company, Inc., 1991. 
13. SPARC  traps  under  Sun OS,  Sun  Microsystems  Inc.,  1997,  available  at 
http://sunsite.ics.forth.gr/sunsite/Sun/SparcTraps. 
14. S.  C.  Woo, et aI.,  "The SPLASH-2 Programs: Characterization and Methodological 
22
nd Considerations,"  Proceedings  of the  Annual  International  Symposium  on 
Computer Architecture, June 1995, pp. 24-36. 71 
15. H.  Kwak,  "A  PerfOlmance  Study  of Multithreading,"  Ph.D  Thesis  submitted  to 
Oregon State University,  1998. 