Paint: PA instruction set interpreter by Stoller, Leigh B. & Swanson, Mark R.
Paint:
PA Instruction Set Interpreter 1
Leigh B. S toller 





Department of Computer Science 
University of Utah 
Salt Lake City, UT 84112, USA
September 11, 1996
A b s tra c t
This document describes P a in t, an instruction set simulator based on Mint[3]. Paint interprets 
the PA-RISC instruction set, and has been extended to support the Avalanche Scalable Computing 
Project[2]. These extensions include a new process model tha t allows multiple programs to be run 
on each processor and the ability to model both kernel and user code on each processor. In addition, 
a new address space model more accurately detects when a program is accessing an illegal virtual 
address, allows a program’s virtual address space to grow dynamically, and does lazy allocation of 
physical pages as programs need them.
Note tha t this document is intended to be an addendum to the original Mint technical re­
port, which the reader should consult for an overview of the Mint simulation environment and 
terminology.
1This work was supported by a grant from Hewlett-Packard, and by the Space and Naval Warfare Systems 
Command (SPAWAR) and Advanced Research Projects Agency (ARPA), Communication and Memory Architectures 
for Scalable Parallel Computing, ARPA order #B990 under SPAWAR contract #N00039-95-C-0018
1 In tro d u c tio n  ®
1.1 Program Driven S im u la tio n ...................................................................................................... 3
1.2 Virtual Memory M odel................................................................................................................ 4
1.3 BackEnd In te r fa c e ......................................................................................................................  4
2 P ro c e ss  M o d el ®
2.1 User Level View .......................................................................................................................... 5
2.2 Instruction Execution ................................................................................................................ 6
2.3 E v e n ts .............................................................................................................................................  7
2.4 Tasks ............................................................................................................................................................  8
3 P a in t A d d ress  Space ®
3.1 Address Space O rgan iza tion ......................................................................................................  10
3.2 Shared Memory S u p p o r t............................................................................................................. 12
3.3 Fork and Exec .............................................................................................................................  12
3.4 Cache S u p p o r t .............................................................................................................................  13
4 S ta ll O n U se; A sy n ch ro n o u s  W rite s , I /O  S pace 14
4.1 Valid R e g is te r s ............................................................................................................................. 14
4.2 Asynchronous E ven ts ...................................................................................................................  14
4.2.1 s o u _ lo a d ( ) ......................................................................................................................  16
4.2.2 so u _ sto re ()  ...................................................................................................................  16
4.2.3 Simple Read E x a m p le ................................................................................................... 16
4.3 I/O  Space A ccess .......................................................................................................................... 17
5 M ach ine  D e p e n d e n t In te rfa ces  18
5.1 System Calls ................................................................................................................................  18
5.2 In te r ru p ts ....................................................................................................................................... 18
5.3 Traps .............................................................................................................................................  19
6 Specify ing  S im u la tio n  P a ra m e te rs  19
7 C o m m an d  L ine A rg u m e n ts  20
8 S im u la to r S u p p o rt fo r th e  K e rn e l 21
8.1 n e w v p ro c O ...................................................................................................................................  21
8.2 g e t v p r o c O ...................................................................................................................................  21
8.3 getmaxnodes ( ) ............................................................................................................................. 21
9 M isce llaneous N o tes  21
9.1 Building Paint .............................................................................................................................  21
9.2 An Example Kernel ...................................................................................................................  21
9.3 Who To C o n ta c t .......................................................................................................................... 22
9.4 C red its .............................................................................................................................................  22
C o n t e n t s
2
1 I n t r o d u c t i o n
This note describes the Paint (PA Interpreter) simulation environment. Paint is based on the 
M int[3] simulation system developed a t the University of Rochester, and has been modified to in­
terpret the PA-RISC[1] instruction set and to support the Avalanche Scalable Computing Project[2], 
These changes are documented here. The reader is encouraged to read the original Mint report be­
fore proceeding, but as a review the next few sections present the essential concepts. More detailed 
descriptions follow later.
1.1 P r o g r a m  D r iv e n  S im u la tio n
Paint is a program-driven simulator, partitioned into two main parts: a memory reference generator 
(the “frontend”) and a target system simulator (the “backend”). The frontend models the execution 
of a program by simulating the instruction stream. When an instruction causes or requires special 
operation, such as a memory reference or special system instruction, the frontend generates an 
event for the backend to operate on. The backend models the memory hierarchy and interconnect 
of the target system, including, but not limited to, the first level cache, the TLB, the system bus, 
main memory, and the network interface. When the operations for carrying out the event have 
completed, the backend signals the frontend to continue execution of the instruction stream for 
tha t processor.
Figure 1: Program Driven Organization
Program execution in Paint is interpreted; the instruction stream consists of a sequence of data 
structures representing the actual instructions. The state of the processor is represented in a global 
data structure. Processor state includes the values of registers, virtual to physical page translation 
tables, the current program counter, etc. As instructions are interpreted, the value of this structure 
changes. When the simulator switches to a new processor, a different global structure is installed 
as the current processor, and execution continues as before.
When a program is loaded, the text portion of the file is scanned and converted to a linked list 
of structures called instruction codes (or icodes). The icodes are linked together, for both sequential 
and non-sequential execution (branches and jumps). Each icode stores information about how to 
interpret the instruction, as well as a pointer to a function to handle the actual interpretation. In 
general, each instruction has a specific function, although some have more than one when certain 
opportunities for optimization are detected. Execution then consists of calling the function for 
each instruction, which may modify the processor state, and which returns a pointer to the next 
instruction icode to simulate, which might be the next sequential instruction or the target of a
3
P1 P2 P3

















Figure 2: Execution Timeline
branch or jump.
1 .2  V i r tu a l  M e m o ry  M o d e l
Memory reference instructions are handled specially by the Paint frontend. Once the user virtual 
address is computed from information in the icode structure, a virtual to physical translation must 
be performed. Two translations are actually produced; the first is a “processor” physical address 
tha t is used by the backend, and the second is a “paint” physical address th a t corresponds to the 
actual location within Pain t’s address space. The processor physical address is used by backend 
modules tha t require realistic physical addresses, like a cache or memory bus module. The Paint 
physical address is used as the location to actually read and write data to memory. When the TLB 
module is in use, the simulated kernel sets the processor physical address using appropriate target 
machine instructions. If the TLB is switched off, the processor physical address is set equal to the 
Paint physical address.
1 .3  B a c k E n d  In te r f a c e
As mentioned above, the simulator frontend is responsible for executing instructions until something 
interesting occurs, such as a memory reference. At this point instruction execution is suspended 
and an event is generated for the backend. The event is a data structure tha t packages up details 
about the event so th a t they may be communicated to the backend. The backend then operates 
on the event, possibly scheduling tasks to handle event activities. Tasks are scheduling entities 
tha t contain a time to run and a dispatch function to invoke when the specified time arrives. In 
this way, multiple concurrent activities can be in progress, including: instruction execution by 
processors th a t are not blocked waiting for an event to complete, memory hierarchy activities in 
support of processor load and store instructions, and network transmission and reception. When 
the event is complete, the backend signals the frontend tha t instruction execution for the specified 
processor may proceed (in fact, instruction execution is a task associated with each processor). See 
figure 2 .
4
2 P r o c e s s  M o d e l
By far, the most widely reaching change to Mint was to the process model. Mint was originally 
a Single Program, Multiple D ata (SPMD) system. A program would start, possibly do some 
initialization, and fork one or more children. Each child was considered not a only a new process, but 
a new processor as well. In fact, process and processor were essentially the same. The disadvantage 
of this model is tha t multiple programs cannot be run, forcing a particular programming model 
tha t is not always appropriate for distributed memory machines. Further, by not being able to 
simulate multiple programs per node, the time and memory effects of “kernel mode” cannot be 
measured since all operating system functionality was implemented in the simulator itself. This 
has the effect of making many operations cost free, thus skewing the simulation results.
The process extensions made to Mint allow it to run a functional kernel on each node (see 
section 9.2), and an arbritrary set of user programs on each node. Operationally, each simulated 
processor is represented by a single Paint thread. The kernel is the first program to run within tha t 
thread, followed by 1 or more user level programs. Like a real machine, the kernel context switches 
bewteen user programs with appropriate target machine instructions tha t change the register state 
of the processor. Paint maintains an association between the different programs and the simulated 
processor those programs are running on, which allows Paint to switch the instruction stream and 
virtual address context when requested to do so by the kernel.
Before presenting a detailed description of the frontend operation, a user level view of the 
process model is given.
2.1  U s e r  L ev e l V ie w
When Paint is started, the program it is given to simulate is the kernel. The kernel then duplicates 
itself on each virtual processor as the first program using the n ew v p ro c () system call. Once the 
kernel is running on the new processor, it forks a child, and execs the init  program. The init 
program reads a setup file tha t specifies which programs to run on each processor, and forks/execs 
the programs for its processor. The init program then exits. At this point the kernel goes into an 
idle loop, waiting for system calls and interrupts. Of course, the kernel does not actually idle since 
executing instructions tha t do nothing is too costly in a simulated environment. Instead, the kernel 
puts itself to sleep. The simulator wakes the kernel up when it needs to do something.
Since the goal is to run “user” program binaries unchanged, and without special compilation, 
the system call interface is identical to the one used in the BSD and HPUX kernels. When the 
simulator detects a system call in a user program, it vectors the instruction stream for tha t thread 
to a known address in the kernel. Eventually the kernel must handle the system call. For most 
calls, it means letting the simulator take care of it (see Section 8 in the Mint User Manual) by 
“calling” the intended system call function, just as the user program did. The difference is tha t 
when the kernel calls a system call (say, open(...)), the simulator intercepts the call and handles the 
operation, returning a result (a file descriptor in the case of open()). The kernel then returns the 
value to the user program just as a production kernel does. The goal is to have the kernel catch all 
system calls so it can decide which ones are handled in the simulated kernel, and which are passed 
onto the simulator itself.
Multiple programs can be run on each node. Additional support from the simulator allows 
the kernel to context switch between multiple kernel threads. When the simulator executes one of 
several PA instructions (be, ble, rfi), the simulated instruction stream is switched to a different set of 
instructions, as defined by the PA architecture. In other words, a single Paint thread multiplexes 
several simulated kernel threads using real context switch code to change register and program
5
counter values.
Process scheduling is done in the kernel, using the BSD 4.4 scheduling subsystem. The simu­
lator generates simulated clock interrupts tha t are delivered asynchronously to the kernel so tha t 
it may update the scheduling data  structures, recompute process priorities, and possibly arrange 
for the current process to be context switched out. Kernel timers are also supported. The de­
fault period of the clock is 100,000 cycles, and is a configurable option to the simulator using the 
VPROC_clockperiod param eter value (see section 6).
Asynchronous interrupts and traps are handled in a manner similar to system calls. When 
a simulation module generates an interrupt or a trap  for a processor, the instruction stream for 
the currently running process on tha t processor is vectored to a known location in the kernel. A 
standard state save is done (written in assembly language), then a call is made to a C dispatch 
function to handle the interrupt.
The following sections describe the Paint frontend in more detail. Later sections expand further 
on key areas.
2 .2  I n s t r u c t io n  E x e c u t io n
Instruction execution is the most basic operation in Paint. At its simplest, the instruction loop 
takes the current instruction, represented by a pointer to an icode, calls the dispatch function 
contained in the icode, and receives back a pointer to the next instruction to execute. This repeats 
until an event is generated, or until a maximum number of instructions have been executed in a 
row. At this point a rescheduling operation is performed, and a new task is selected to run. This 
new task might invoke the instruction execution loop for a new processor, or it might be a task 
tha t is working on an event for some processor, or it might be an anonymous task th a t is scheduled 
to perform some operation in a simulation module. This operation repeats until there are no more 
tasks scheduled to run, at which time the simulation terminates.
When a task does invoke the instruction loop, it begins execution with the current instruction 
pointer. Figure 3 shows the icode data structure. Many of the fields are specific to the actual 
instruction. For example, the immed field holds the signed immediate value for any instruction 
whose format includes an immediate. Other fields have a common usage during execution, and 
should be described:
func The function to invoke to handle the actual simulation of the instruction, 
next A pointer to the next sequential instruction in the code stream.
target A pointer to the branch target instruction when the instruction is a conditional 
or unconditional branch, and the target can be computed statically.
cycles The number of CPU cycles the instruction consumes, not including memory 
hierarchy delay. The value is added to a running count as instructions are 
simulated.
validregs The set of scaler registers used by the instruction, represented as a bitmask, 1 
bit for each of 32 registers. This field is used to implement stall on use loads 
(see section 4).
validfregs The set of floating point registers used by the instruction, represented as a 
bitmask, 1 bit for each of 64 singles, or 2 bits for each of 32 doubles (see section
4).
6






















Figure 3: Instruction Code D ata Structure
Each instruction function returns a pointer to the next icode to execute. The next icode is either 
the next sequential instruction, or the target of a branch, or a dynamically computed jum p target. 
The first two pointers are computed when the program is loaded, and stored in the icode structure. 
The pointer to the next icode for a dynamically computed jump target is returned by the T2I 
function, which takes the target address of the jum p as its argument. When the instruction loop 
is suspended, the last icode pointer is stored in the processor data  structure (this is effectively the 
program counter), and the cycle count for the processor is incremented. The instruction execution 
task for the processor may be rescheduled to run at the new processor cycle count, or it might not 
if the loop was suspended due to an event. Time moves forward since each processor runs for a 
short time (possibly ahead of other processors), with all processors eventually getting a chance to 
move forward, (see figure 2).
2 .3  E v e n ts
While the Paint frontend is primarily concerned with instruction execution, it is the Paint backend 
tha t is responsible for more detailed simulation of selected architectural features. The most obvious 
example is the memory hierarchy. In the absence of event generation, all loads and stores would 
take a fixed amount of time, which is unrealistic. The backend writer can instead model a detailed 
memory hierarchy. Meanwhile, other processors can continue ahead until some synchronizing event 
occurs. There are many types of events tha t can be generated for the backend. This document 
will concern itself with just memory events, so the reader should consult the Mint[3] document for
7
s t r u c t ta s k  {
s t r u c t ta sk ♦ n e x t;
s t r u c t ta sk *prev;
in t p r i o r i t y ;
i n t p id ;
m int_ tim e_t t im e ;
PFTASK u fu n c ;
s t r u c t event * p ev en t;
i n t i v a l l ;
vo id * u p t r l ;
} ta s k _ t ,  * ta sk _ p tr ;
Figure 4: Task D ata Structure
a discussion of other events.
Events in the Paint frontend look much like an instruction. They are represented by icodes, 
but with a function pointer to a routine tha t initiates the event. These special icodes are placed 
into the instruction stream when the program is loaded. The loader scans the instructions, and 
in the case of memory reference instructions, creates a duplicate icode, flags it as an event, and 
replaces the function with the appropriate event routine (either even t_ read() or event_w rite). 
The original icode is left as the next sequential icode. In other words, each memory reference icode 
is preceded by a new event icode tha t causes the frontend to suspend execution and invoke the 
appropriate backend function. When the backend signals tha t execution can continue, the original 
icode is executed to effect the changes in processor state required by the particular instruction. 
For example, PA-RISC loads and stores do base register modification, which must occur after the 
memory reference completes.
2 .4  T a sk s
When the backend function for an event is called, it is given a single argument, a pointer to the 
task controlling instruction execution for tha t processor (see figure 4). When the backend function 
returns, it indicates via a status value whether instruction execution should continue or suspend 
until some future time, and whether the task should be put back on the free list. If execution is 
suspended, then it is up to the backend to save and eventually reschedule the task so tha t instruction 
execution may proceed. The fields of the ta sk _ t data  structure are:
n e x t , p rev  Queuing elements. These fields can be used only when the task is not currently 
scheduled to run since they are also used by the scheduling system.
tim e The absolute time at which the task should be run.
p r i o r i t y  The task priority. If multiple tasks need to run in the same timestep, and they 
need to  be ordered, a priority can be assigned to force one task to run before 
or after another. ,
p id  The processor ID the task is executing on behalf of. 
ufunc The function to invoke when the task runs. This function must return one of:
8
T JIDVANCE T h e processor associated  w ith  th is  task  m ay continue execu tin g  
in structions.
T J R E E  This task is put on the free list and the next task with the 
minimum time is removed from the task queue and executed.
T_YIELD This is the same as T_FREE except tha t the task is not put on 
the free list. Only a reschedule is performed.
pevent A pointer to the event data structure tha t was constructed by the frontend.
ivall.uptrl Storage location for arbitrary values to pass to the function. There are many 
more such variables, so the reader should consult the header file.
When the backend function is invoked, the pevent field of the task_t data  structure points to 
the event structure created by the frontend. The event structure is quite large and can accommodate 
many types of events, so the reader should consult both the Mint document and the ‘ ‘ev e n t .h ’ ’ 
header file. The various task scheduling functions are described in the Mint document.
3 Paint A ddress Space
This section describes the changes to address space translation. Paint dynamically translates 
addresses during simulation, using a simple address translation formula tha t converts a program 
“virtual” address into an address in Paint’s “physical” space. There were several characteristics of 
the original memory model tha t needed improvement:
•  Address space protection: Program errors can easily generate illegal virtual addresses, which 
when translated to physical addresses, reference data  in another program, or in the simulator 
itself. The translation mechanism should check the validity of each virtual addresses presented 
for translation.
•  Dynamic allocation of memory: The program’s data  and stack segments should be allowed 
to extend past their original size as needed.
•  Lazy allocation of memory: The physical pages for the bss, heap and stack should not require 
allocation until they are referenced by the running program. This would reduce the number 
of unused, and thus wasted, pages. With a small number of nodes, this is not an issue, but 32 
and 64 node simulations of even moderate sized programs become difficult, even on machines 
with hundreds of megabytes of swap space and real memory.
Our approach was to implement page tables in the simulator. Page tables allow us to accurately 
detect when a program is accessing an illegal virtual address, to implement dynamically sized 
segments, and to allocate physical pages lazily as the program needs them. An additional benefit 
is tha t TLB information can be stored in the page tables. Finally, a recent optimization allows 
cache line status to be stored in additional data structures attached to each page table entry. By 
utilizing this information in the frontend, calls to the backend for each and every memory event 
can be avoided, resulting in a twofold increase in simulator performance.
TLB support was then implemented using fields in the page table structure. Before a memory 
event is allowed to proceed, the corresponding page table entry is accessed. If the page is marked 
as currently being in the TLB, and the read/write access permission bits match the type of access, 
the memory event is allowed to proceed normally. If the page is not in the TLB, or if the access
9
type is wrong (ie: writing a read-only page), a TLB miss is generated by calling tlb_dom iss() in 
‘ ‘t l b . c .  ’ ’ In addition to some book keeping, a processor trap  is generated with vp roc_ trap () 
(see section 5.3). Subsequent TLB insertion instructions (for the PA, id t lb p  and id t lb a )  executed 
by the kernel cause the TLB information for tha t page to be updated and the instruction retried. 
The full cost of the TLB is modeled.
The current TLB model is very simple. A 96 entry, fully associative TLB is modeled by 
maintaining a list of page table structures tha t are currently in the TLB. When a capacity miss 
requires an entry to be replaced, the oldest entry is removed from the list, and the new one entered. 
The size of the TLB can be altered with the TLB_numentries parameter field (see section 6). While 
the PA-RISC supports a rather rich set of TLB options (protection identifiers, multiple privilege 
levels, etc.), the simulated TLB supports only read/write access permissions. By default, TLB 
modeling is turned on in the simulator. To turn the TLB off, use the TLB_on param eter field, 
setting it to 0 .
The following subsections describe the specifics of address translation.
3 .1  A d d re s s  S p a c e  O rg a n iz a t io n
Figure 5 shows Pain t’s address space organization. Process virtual addresses are mapped to both 
a processor physical address, and to a Paint physical address. The processor physical address is 
assigned by the simulated kernel. The purpose of this address is to provide realistic processor 
physical addresses to simulation modules such as the cache or memory bus. Processor physical 
addresses are supplied by the kernel with PA-RISC TLB insertion instructions. When the TLB is 
turned off, the processor physical addresses are always set to the Paint physical address. The page 
table entry (figure 6 ) includes several TLB bits tha t indicate if the page table entry is currently in 
the TLB, and the type of access permissions (read or read/write) tha t the translation was inserted 
with.
Paint physical addresses are locations inside of the Paint program where simulated program 
data is actually stored. These addresses are known only to Paint, and are assigned when a memory 
reference touches a page for the first time. It is not until this point tha t a page in Pain t’s program 
space is allocated and the page table entry created (see Figure 6). This lazy allocation of pages 
allows larger simulations since it is often the case tha t programs never reference many of their 
pages. This arrangement allows programs to grow more dynamically as well. Like a real kernel, 
program segments are assigned a maximum size (16 megabytes for data, 2 megabytes for stack), 
and a vector of page table entry pointers for each possible page is created. A program can grow, 
lazily allocating pages until it reaches tha t maximum. The overhead is small, given tha t there 
are only 4096 page table entry pointers for a 16 megabyte segment, or 16K bytes of storage per 
segment. Both maximum values can be overridden using the param eter file entries MAX_STACK_SIZE 
and MAX_DATA_SIZE. Several other fields in the page table entry should be noted:
ty p e  The type of page, currently either a normal data  page or a shared memory page. 
This information is passed to the backend.
d e a llo c  Flag to indicate whether the underlying physical page should be deallocated 
when the page table is reclaimed. This is used when page table entries share a 
common physical page (as with shared memory segments).
h i t s  Cache hit information. Used in the frontend to avoid calls to the backend cache 













Figure 5: Paint Address Space
11
9
typedef s t r u c t  {
u_long pframe 2 0 , /*
type 1 2 ;
u_long vframe 2 0 , /*
p b i t s 1 2 ;
u_long ffram e 2 0 , /*
d e a llo c 1 ,
t l b v a l id 3,
re se rv e 8 ;
p tc l in e _ t  h its[PT C SIZ E ];
} p ta b le _ e n t ry _ t , p te _ t ;
Figure 6 : Page Table D ata Structure
3 .2  S h a re d  M e m o ry  S u p p o r t
Shared memory support is provided in Paint through the use of the page table system. The virtual 
address range starting at OxCOOOOOOO is defined to be the shared memory address space. Each 
processor has a single set of page table entries for the segment, and they are shared among all of 
the processes on a processor. The underlying Paint physical pages are shared among all of the page 
table entries on all of the processors. Thus, memory accesses on different processors refer to the 
same memory location since they share a single page. In the Paint frontend, the only difference 
when generating the memory event is tha t it is flagged as being to the shared memory segment. 
Any special handling is expected to be done in the backend by the cache module.
3 .3  F o rk  a n d  E x e c
Paint provides the support necessary for both of the UNIX system calls, fo rk  and exec. When a 
user process executes a fork or exec system call, the kernel does whatever bookkeeping it requires, 
and then passes the call onto Paint itself. In the case of fo rk , Paint then duplicates both the process 
page tables and the contents of the pages. The virtual page addresses are the same in the child’s 
version of the page table entries, but there are new Paint physical addresses for each duplicated 
page of data. Pages tha t had not been touched in the parent, and thus were not allocated, are left 
unallocated in the child. All of child’s page table entries are marked as not being in TLB, and the 
processor physical addresses are cleared. When the process eventually runs, normal TLB misses 
will provide the new processor physical addresses. This mimics the operation of a real kernel in 
which fo rk  duplicates exactly the virtual address range, but maps those virtual addresses to a new 
set of physical pages. For exec, Paint first reclaims all of the page table entries and pages, and 
then loads the new program. A new set of page table entries is created, and the initialized data is 
loaded. All other pages (bss, heap, stack) are allocated lazily as the program references them.
As can be seen, much of the support for fo rk  and exec is contained within Paint itself. The 
kernel includes its own support, but is much simpler tha t a production kernel would be. The bulk 
of the machine dependent virtual memory support is contained in ‘ ‘vm .c’ * in the kernel.
12
The Paint frontend includes several extensions tha t allow it to optimize memory events. The first 
is called fasthits  mode, and is used to decrease the number of memory references tha t result in 
backend events. Decreasing the number of backend events improves overall simulator performance 
since suspending the instruction loop and invoking the backend is a very costly operation. When 
fasthits mode is turned on (using the P T _ fas th its  parameter field), the frontend consults the h i t s  
field of the page table entry to determine if the cache line being accessed is currently in the cache. 
If it is, no backend event is generated, and instruction execution continues immediately. Execution 
will continue until a maximum threshold of sequential instructions is reached, at which time a 
rescheduling operation is performed so tha t a new task may run. The threshold defaults to 50 
instructions, and can be set using the P T jfast count parameter value.
The backend cache module is responsible for telling the frontend which lines are currently in 
the cache. There are two functions provided to the backend, one to indicate th a t a line has been 
inserted into the cache and another to indicate tha t a line has been evicted from the cache. In 
addition, the backend provides a function to the frontend so tha t the frontend can signal when a 
fasthit dirties a cache line. Thus, the frontend and the backend co-manage this state information 
as the simulation proceeds. The prototypes for the two function called by the backend are:
p ta b le _ c a c h e _ v a lid a te ( in t  p ro c id , in t  sp a c e id , unsigned  long v a d d r) ;
p ta b le _ c a c h e _ in v a l id a te ( in t  p ro c id , in t  sp a c e id , unsigned  long v ad d r);
The prototype for the function called by the frontend is:
f lc _ c h a n g e s ta te _ d ir ty ( in t  p ro c id , in t  sp a c e id , unsigned  long v ad d r);
For each routine, p ro c id  is the processor number, sp ace id  is the PA-RISC space identifier for 
the access, and vaddr is the virtual address of the line being accessed.
The second optimization mode provided by the frontend is called fastm isses  mode, and is 
controlled by the PT_fastm isses parameter value. Fastmisses mode requires fasthits mode be 
turned on. When fastmisses is on, no memory events (except those to I/O  space locations) are 
generated. Instead, the frontend calls a function in the backend cache module to indicate tha t a 
line has been accessed. This allows the cache to be warmed up with the proper data, but without 
the expense of going to the backend. This mode is most useful during startup  and initialization 
phases where the speed of the simulation is more im portant than accuracy. The function prototype 
provided by the backend is:
f lc _ f a s tc a c h e _ in s e r t ( in t  p ro c id , in t  sp a c e id , in t  p id ,
unsigned long vaddr, unsigned  long pad d r, i n t  rw );
In order for this mode to be useful, it is necessary to provide a mechanism to turn it off at some 
point during the simulation, switching to the more accurate cache model. A simulated program 
level function call is provided tha t can be used in either the kernel or a user program to turn 
fastmisses mode on or off. This function is trapped by Paint itself:
fa s tm issm o d e(in t o n o ff) ;
3 .4  Cache Support
13
4 Stall On Use; A synchronous W rites, I /O  Space
This section describes the changes necessary to support stall on use (SOU), asynchronous writes, 
and I / O  space access The PA-RISC cache model employs all of these features, so supporting them 
was essential for realistic simulations in the Avalanche project. Stall on use allows the processor to 
proceed after a load operation, until the target of the load is referenced in a subsequent instruction. 
Only then must the processor stall until the load is complete. Asynchronous writes allow the 
processor to proceed immediately after a store. In this case, the cache stalls the processor when 
there are no more slots in which to hold the pending store. I/O  space memory references are 
required to access Avalanche devices tha t are mapped into regions of the processor’s address space.
Supporting these new features required changes to the frontend and to the interface between 
the frontend and the backend.
4 .1  V a lid  R e g is te r s
Stall on use support requires tha t the frontend know which registers are referenced in each instruc­
tion. For non-memory instructions, the source registers must all be valid. The target registers 
must also be valid before allowing the operation to proceed, even though real hardware would not 
necessarily require it. This is to prevent prior loads to the same register, th a t have not completed 
yet, from subsequently overwriting the target. We do this as a simplification since it rarely happens 
tha t a register is destroyed before a previous load to tha t same register is used (and thus, would 
stall). For memory instructions, the base and index registers, as well as the target registers must 
all be valid before the instruction can proceed.
In order to determine which registers are referenced by each instruction, and which registers are 
currently valid during execution, a v a lid re g s  data structure was added to the icode_t structure. 
When a program binary is loaded, the validregs structure for each instruction is initialized with 
a list of registers tha t are referenced in tha t instruction. A similar structure was also added to 
the processor structure. As execution proceeds, the current set of validregs for the processor is 
compared against those referenced in each instruction. If all of the referenced registers are valid, 
the instruction executes normally. If there are invalid registers, the processor is stalled until the 
backend indicates to the frontend tha t execution can continue.
Load instructions are a special case since they modify the current set of valid registers. The 
target register of the load is made invalid. At some later time, the backend will indicate tha t the 
load has completed, and tha t the register can be made valid again. If the processor had been 
stalled because of a subsequent reference to tha t register, it is restarted. This state change is 
communicated to the frontend using the sou_load function described below.
The validregs information for each instruction is initialized using the SETVALIDREG macros (there 
are variants for integer, floating point, and double registers). See the instruction decode functions 
in * ‘t e x t . c ’ ’ for an example.
4 .2  A sy n c h ro n o u s  E v e n ts
The original interface between the frontend and the backend was through the use of event_t 
structures. This structure carried all of the information needed by the memory system module to 
carry out the operation. This interface is synchronous in nature; the backend must either stall the 
processor immediately, or copy all of the information out of the event structure before allowing the 
processor to continue. This is because there is just one event structure per instruction execution 
task, which is reused for all events tha t are sent to the backend. All of the information in the event
14
typedef struct {
unsigned long value  [2] ;
unsigned long *paddr;
unsigned long v ad d r;
sh o r t regnum;
sh o r t ty p e ;
sh o r t sp ace id ;
sh o r t v p ro c ;
sh o r t p id ;
} rw _ tra n s_ t;
Figure 7: rw_trans_t D ata Structure
structure must be captured before the frontend is allowed to continue, or it will be lost when the 
next event is reached. In order to support a more asynchronous interface between the frontend 
and the backend, a new structure, rw _trans_t was introduced. See figure 7. This new structure 
saves the backend from having to copy out the event information. A brief description of the fields 
follows:
va lu e  The value being written to memory in a store instruction, or the value being 
loaded in an I/O  space load. The value is captured since the operation may not 
proceed until a later time, and the data  might be altered before then by other 
tasks. The field is large enough to support double word operations.
paddr The Mint “physical” address of the memory location the data is being written 
to or read from.
vaddr The program “virtual” address tha t was referenced.
regnum The register number tha t is the target of a load instruction.
type  Various type bits to indicate such things as the size of the operation, whether 
it is to the shared address space, etc.
sp ace id  A PA-RISC implementation specific space identifier, 
vproc The processor number, 
p id  The global process identifier.
The above structure is created by the frontend when calling any of the backend functions 
sim _read(), s im _w rite(), s im _ flu sh (), sim _purge(), or sim _sync(). A pointer to the structure 
is placed in the pending slot of the event_t structure. The backend function can copy tha t pointer, 
but if it plans to let the processor continue asynchronously, it must set the pointer to NULL before 
returning T_ADVANCE. This tells the frontend to create a new structure at the next event. This is an 
optimization tha t prevents the creation of a new data  structure on each event unless the backend 
captures the previous one. All other backend event functions use the original event_t structure 
interface as described in the Mint document.
Once the backend determines tha t a load or store operation can proceed, and the contents 
of the registers or memory can be changed, it will call either sou_load (for load instructions) or
15
sou_sto re  (for store instructions) to handle the actual operation. This prevents the backend from 
having to know numerous internal details of the frontend.
4.2.1 sou_load()
so u _ lo ad (rw _ tran s_ t * p tra n s , m in t_ tim e_ t s im tim e);
sou_load loads the contents of memory into a register, p tra n s  points to a rw _trans_t structure 
captured by sim_read at some time in the past. The register is marked valid and data transferred, 
and if the processor is stalled, it is restarted, sim tim e is the time at which the processor should be 
restarted. A side effect of sou_load is to free the structure pointed to by p tra n s .
4 .2 .2  so u _ sto re () * 
so u _ s to re (rw _ tra n s_ t * p tra n s ) ;
sou_sto re  stores the contents of a register to memory, p tra n s  points to a rw _trans_t structure 
captured by sim _w rite at some time in the past. The value of the register at the time the store 
was done is contained within the rw _trans_t structure; it is this value tha t is stored to memory. A 
side effect of sou_store is to free the structure pointed to by p tra n s .
4 .2 .3  S im ple R ead  E x am p le
The following example is a simple demonstration of SOU. It is by no means complete, and is not 
functionally correct in the presence of asynchronous writes.
16
#include "mint.h"
in t  re a d _ d o n e (ta sk _ p tr  p ta s k ) ;
s im _ re a d (ta sk _ p tr  p ta sk )
rw _ tran s_ t * p tran s  = p task -> peven t-> pend ing ; 
ta s k _ p tr  *pnewtask;
p task -> peven t-> pend ing  = NULL;
/ *  schedu le  th e  read_done() fu n c tio n  10 cy c le s  from now * / 
pnewtask = sc h e d _ ta sk (p ta s k , read_done, p ta sk -> tim e  + 10 ) ;  
pnew task-> uptr = p tra n s ;
/*  r e tu rn  T_ADVANCE so th a t  ex ecu tio n  co n tin u es  * / 
r e tu rn  T_ADVANCE;
>
read _ d o n e (ta sk _ p tr  p ta sk )
{
rw _ tran s_ t * p tran s  = p ta sk -> u p tr ;
. /*  Complete th e  load  o p e ra tio n . * /
so u _ lo a d (p tra n s , p ta sk -> tim e ) ;
r e tu rn  T_FREE;
>
4 .3  I / O  S p a c e  A ccess
Paint includes extensions to the frontend tha t allow I/O  space references; memory addresses that 
are not real memory locations (from the processor’s point of view), but refer to I/O  device registers 
tha t are mapped into the processor’s address space. An I/O  space address is any address in which 
the upper 4 bits of the address are set to 1. When the frontend is given such an address, it sets 
the processor physical and the Paint physical address equal to the virtual address. As far as the 
frontend is concerned, this is the only difference at the time the event is generated. The event is 
then passed to the backend, where it is the responsibility of the cache module to determine what to 
do, possibly handing the event off to another simulation module (say, a network module). At some 
point the I/O  space device will complete operation on the event. In the case of an asynchronous 
write, it must free the rw _trans_t structure tha t was handed to it by the cache module. This 
is done with the f re e _ p tra n s ( )  routine. For a load, frontend execution must be restarted by 
calling sou_load(). In both cases, the data to be read or written is contained in the v a lu e  field 
of the rw _trans_t structure. For load instructions, the I/O  space device module is responsible for 
placing the data  in the va lue  field before calling sou_load(). I/O  space devices may operate either 
asynchronously or synchronously, it is up to the simulation module writer and the cache module 
writer.
17
5 M achine D ependent Interfaces
In order to run a realistic kernel and user binaries, it is necessary to support several machine 
dependent interfaces. System calls, traps, and interrupts are all supported in Paint. The next few 
sections describe these interfaces and how they are supported in both the kernel and Paint.
5 .1  S y s te m  C a lls
To avoid special compilation of user programs, Paint must support the system call interface con­
tained within the C Library. Under BSD and HPUX, system calls are invoked by issuing a branch 
and link external (ble) instruction to a fixed location in the virtual address space. This special 
location is called the gateway page. A production kernel will trap  this branch attem pt, and then 
transfer control to a fixed routine, while raising the privilege level and switching the virtual address 
context to tha t of the kernel. To return from the system call, the kernel will issue a branch external 
(be) instruction, which transfers control back to the user program, while lowering the privilege level 
and switching the virtual context back to tha t of the user program.
The requirements for Paint are somewhat less, since the notion of privilege level does not 
need to be strictly enforced. Only the transfer of control and virtual address switch needs to be 
implemented. Paint does this by trapping all b le  and be instructions, examining the target location 
of the branch. For bye, the event_ble routine looks at the target address, and if it is within the first 
page starting at virtual address OxCOOOOOOO, it is assumed to be a system call. Paint then transfers 
control to a known text address (marked by the symbol sy sc a ll_ tra p )  in the kernel by locating 
the proper icode for tha t instruction, and returning it from event_ble. Execution continues with 
tha t instruction. When the kernel initiates a return from the system call, a similar sequence of 
events occurs in the event_be routine. The target address of the branch is checked to ensure tha t 
it is in the user’s program space, after which the proper icode is located and returned. In both 
cases, the privilege level of the processor is changed to indicate tha t a switch occurred, but there 
is no check made when executing instructions; a user program may execute privileged instructions 
if it wants. So far, this has not presented any problems.
5 .2  I n t e r r u p t s
Processor interruptions are anomalies tha t occur during instruction execution which cause the 
flow of control to switch to a known location in the kernel. External Interrupts  are processor 
interruptions which are delivered asynchronously with respect to the instruction stream. Clock 
interrupts and device interrupts are two types of asynchronous interrupts tha t are supported by 
Paint and the kernel. Paint provides a interface tha t can be used by simulation modules (say, 
a clock device) to generate an interrupt for the processor. For example, Paint generates a clock 
interrupt every 100,000 cycles by scheduling a task to call the following function:
v p ro c _ in te r ru p t{ in t  p ro c id , in t  in t e r r u p t ) ;
where p ro c id  is the processor number, and in te r r u p t  is a number in the range 0 to 31 (clock 
interrupts are always delivered as number 31). On the PA-RISC, all external interrupts are deliv­
ered to a single interrupt handling routine, while the actual interrupt is specified using the external 
interrupt request register (EIRR).  The system  priority  level (SPL)  is encoded in the external in­
terrupt enable mask (EIEM )  register. Interrupts are masked by the processor either by clearing 
bits in this mask, or by turning off all interrupts via the processor status word. As a simplification, 
both Paint and the kernel currently agree to use either an interrupts on or interrupts off policy; the
18
kernel is either masking all interrupts, or masking none of them. W ith so few devices generating 
interrupts, this is not expected to be a problem.
When an interrupt is delivered, control is transferred to the external interrupt handler routine 
in the kernel. This is an assembly language routine (derived from the PA-RISC version of Mach 
3.0) which saves the machine state before calling the higher level interrupt handling code. When 
the kernel returns from the interrupt, it will restore the machine state and issue an rfi instruction. 
The interrupt handling code is entirely realistic; only a few lines of code needed to be changed for 
it to run under Paint.
5 .3  T ra p s
Traps are processor interruptions tha t occur synchronously with respect to the instruction stream. 
A data page fault tha t results in a TLB miss would be one such example. Traps are initiated by 
calling the the v p ro c_ trap () routine, which returns the icode of the first instruction of the proper 
trap  handling routine in the kernel. This will be the next instruction to execute instead of the one 
tha t caused the trap. When the kernel is ready to continue the process tha t caused the trap, it will 
restore the machine state, and issue a return from  interrupt (rfi) instruction. Typically, the kernel 
will restart the process a t the instruction tha t caused the trap. To initiate a trap:
ic o d e_ p tr
v p ro c _ tra p ( th re a d _ p tr  p th re a d ,
ic o d e _ p tr  p ico d e , in t  trapnum , rw _ tran s_ t * p tra n s ) ;
The current set of supported traps are:
I_DPGFAULT D ata page fault. The requested virtual address translation is not present in the 
TLB.
I_LPRIVXFER Lower privilege transfer trap. An instruction is about to be executed a t a lower 
privilege level than the current instruction, and the PSW L-bit is set. This is 
used by the kernel to deliver asynchronous transfer traps (AST) in the interrupt 
handler.
I_DMEMJICC D ata memory access trap. The requested virtual address translation is in the 
TLB, but the access permission bits are incorrect for the type of access.
6  Specifying Sim ulation Param eters
One of the problems we encountered was command line option explosion. Many of the simulation 
modules needed their own set of command line options, and it became confusing and difficult to 
maintain since only a single module can use an option. To address this problem, we added a 
parameter file tha t can store name/value pairs, much like the Xdefaults file in the X I1 window 
system (although much simpler). The simulator can call the g e t .param eter () routine to get the 
value of a particular parameter.
The default file name for the parameter file is “./mint_params.” A different filename can be 
specified as a command line option to the simulator (see Section 7). The format of a parameter 
file is very simple. Blank lines and lines beginning with a “#” are ignored. The first field on a 
non-blank line is the name of the parameter, and the second field is the value. For example:
19
# This i s  a sample param eter f i l e .
*
U sefu l_param eter_ l 0 
U seful_param eter_2 1
A simulation module can then access the parameter values using the following function:
g e t .p a ra m e te r(c h a r  *param eter_nam e, void  *va lu e , in t  type)
The string pointer parameter_name is the name used to match on in the param eter file. The 
match is case sensitive, va lue  is a pointer to storage large enough to hold the param eter value. If 
no matching name is found, va lue  is left unchanged. The type  argument specifies what type of 
parameter value is expected. The 3 possibilities (defined in mint.h) are:
PARAM_INT The value is an integer.
PARAM_FLOAT The value is a float.
PARAM_STRING The value is a string. The storage should be large enough to hold the largest 
string expected.
A simple code fragment tha t requests a parameter value follows:
{
in t  U sefu l_param eter = 0;
g e t_ p a ram e te r("U se fu l_ p a ram e te r_ l" , & U seful_param eter, PARAM_INT);
i f  (U sefu l_param eter) 
do_som eth ing();
>
7 C om m and Line A rgum ents
This section describes the command line options tha t have been changed or added to Paint.
- z  Turn on virtual processor mode. By default, Paint runs in SPMD mode for 
backwards compatibility and the virtual processor functions described in Sec­
tion 2 are disabled.
-n  count Specify the maximum number of virtual processors when MPMD Paint is en­
abled. The default value is one. The kernel may create this many processors 
(using the newvproc() interface function). A ttem pts to create more will result 
in a fatal error.
- I  Turn on Stall On Use mode. See Section 4 for a description of SOU mode. When 
SOU mode is off, the backend should trea t load instructions synchronously, 
returning T_ADVANCE only when the load is complete and execution can 
proceed normally. The default value is off.
#
20
-k  s ta ck s ize  This option is ignored in MPMD Paint. Stack segments grow on demand.
In SPMD mode, this option can be used only when stacks are in the shared 
memory space (this occurs when the sp roc  system call is used).
-m file The file specifies a list of runtime parameters for the simulator. The default 
value is “mint_params.” See section 6 for a description of the parameter file.
-h  h ea p s ize  This option applies only to the kernel heap size in MPMD Paint. The heap 
space for user programs grows on demand. See Section 3 for more details. The 
default value is 256 4K pages. .
8  Sim ulator Support for the K ernel
There are several new support functions tha t are intercepted by the simulator.
8 .1  new vprocO
Create a new simulated processor. This is functionally equivalent to fo rk ( )  in tha t control returns 
to the parent and the child a t the point following the function call. newvprocO returns 0 in the 
child, and non-zero in the parent. This function is used by the kernel startup  code to duplicate 
itself onto as many processors as were requested when Paint was invoked (using the -n option; see 
section 7).
8 .2  g e tv p ro c O
Return the current simulated processor number. The value is an integer in the range 0 to (Num­
ber _of_Nodes - 1).
8 .3  g e tm ax n o d es()
Return the total number of processors in the simulation. This is set using the -n option to the 
simulator, and is made available to the kernel and to user programs with this function.
9 M iscellaneous N o tes
9 .1  B u ild in g  P a in t
Please consult the README file in the top level directory for instructions on how to build Paint, 
the kernel, and the support programs.
9 .2  A n  E x a m p le  K e rn e l
The kernel tha t we run is not a production kernel, but rather a subset of the BSD 4.4 kernel (Lite-2 
release) tha t can handle system call trapping, asynchronous interrupts, synchronous traps, signals, 
and provides additional functionality required by the Avalanche Scalable Computer Project. These 
include message passing and distributed shared memory system calls and interrupt handlers. This 
kernel is provided in the distribution, so please consult the README file in tha t directory for 
instructions on how to build and run the sample kernel.
21
9.3 Who To Contact
If you have questions or comments, please send email to avalanche@jensen.cs.utah.edu. There are 
several people responsible for the simulator, so this is your best chance to get a response.
The Avalanche Project requests users of this software to return to avalanche@jensen.cs.utah.edu 
improvements tha t they make and grant the Avalanche Project redistribution rights.
9 .4  C r e d i t s
Thanks to Jack Veenstra (veenstra@itagain.mti.sgi.com) for developing the Mint simulator, upon 
which Paint is based.
22
[1] H e w l e t t - P a c k a r d  C o . P A -R IS C  1.1 Architecture and Instruction Set Reference Manual, 
February 1994.
[2] S w a n s o n , M .,  K u r a m k o t e , R .,  T a t e y a m a , T . ,  a n d  S t o l l e r , L . M e ssa g e  P a ss in g  S u p p o rt  
in th e  A v a la n ch e  W id g e t . T ech . R ep . U U C S -9 6 -0 0 2 , U n iv ers ity  o f  U ta h  - C o m p u te r  S c ien ce  
D e p a r tm e n t, M arch  1996 .
[3] V e e n s t r a , J . Mint Tutorial and User Manual. Tech. Rep. 452, University of Rochester 
Computer Science Department, May 1993.
R e f e r e n c e s
23
