In this paper, we present a relatively primitive execution model for fine-grain parallelism, in which all synchronization, scheduling, and storage management is explicit and under compiler control. This is defined by a threaded abstract machine (TAM) with a multilevel scheduling hierarchy. Considerable temporal locality of logically related threads is demonstrated, providing an avenue for effective register use under quasidynamic scheduling.
Introduction
Multithreading at the instruction level may provide the key to general purpose parallel computing [26] , because it allows the processor to tolerate long, unpredictable communication latency [2, 4, 17, 24, 29] . In addition, this level of multithreading is required to support certain modern parallel programming languages [28] , such as Id [20] and Multilisp[18] , and extensions of more conventional languages with synchronizing data structures, e.g. I-structures [6] .
On the other hand, asyn- 164 chronous transfer of control (context switching) is notoriously expensive on current machines, leading many researchers to examine asynchronous parallel execution models through the study of real machines [l 1,13,15,22, 25,27] , paper architectures [l ,5,14,16,19] , and abstract machines [21] . In all of these proposals, the scheduling of threads is viewed as a property of the machine, invisible to the compiler. While we share the view that asynchronous events are the rule, not the exception, in large-scale multiprocessors, we claim that relieving the compiler of responsibility for scheduling low-level program entities squanders critical processor resources, such as high-speed register storage, and places unreasonable demands on the hardware, such as maintaining scheduling queues of arbitrary size. By retaining some of this control, the compiler can optimize the use of processor resources for the expected case, rather than the worst case, and exploit considerable inter-thread execution locality. Thus, tolerance to latency and inexpensive synchronization require that the compiler adopt a suitable program representation, but this need not be manifest in the processor architecture.
To investigate this view, we have formulated a relatively primitive threaded abstract machine (TAM) in which processor resources and thread scheduling are explicit at the instruction level and all storage management is the responsibility of the language syst em, not the machine. We have retargeted the compiler for Id, an extended functional language that relies on dynamic scheduling, to generate TLO (Thread Language Zero), a first-cut instruction set for this machine, rather than dataflow graphs. Although TAM could be realized directly in hardware, it is intended as a vehicle for studying what architectural support is most important for full-scale parallel programs on large parallel machines. To this end, we have developed a versatile translator for TI,O to a variety of existing parallel and sequential machines.
Preliminary measurements indicate that this implementation of Id yields performance between C and Lisp for comparable programs on the same uniprocessor. This disspells the view that the implementation of languages with fine-grain synchronization on conventional architectures must be essentially interpretive. Secondly, dynamic instruction counts under this execution model are comparable to dataflow models, which support synchronization and generation of parallel threads as Dart of every instructi~n.
Third, th~locality among ex~cu-tion threads that are not, or cannot be, statically ordered appears to be substantial. Finally, a large fraction of the potential synchronization events can be compiled away or synthesized cheadv with little or no architectur~l support.
The key 'a~chitectural challenge is an intimate coupling of processor and network, with much of the message decode task delegated to the compiler.
The following describes TAM and its current realization in TLO. Section 2 outlines the basic structure and scheduling mechanism supported by our model and draws comparisons with proposed and existing threaded execution models. Section 3 provides preliminary performance measurements.
Appendix A describes our threaded machine language.
2
The TAM Execution Model 2.1 Storage model and basic structure TAM recognizes three major storage resources-codeblocks, frames and structures-and the existence of critical processor resources, such aa registers. A program is represented by a collection of re-entrant code-blocks, corresponding roughly to individual functions or loop bodies in the high-level program text. A code-block comprises a collection of threadq each thread is a sequence of instructions. Invoking a code-block involves allocating a frame-much like a conventional call frame-depositing argument values into locations within the frame, and enabling threads within the code-block for execution. Figure 1 illustrates the relationship between the codeblock and the frame. Instructions may refer to registers and to slots in the current frame; the compiler statically determines the frame size for each code-block and is responsible for correctly using slots and registers under all possible dynamic thread orderings. (This is somewhat more complex than traditional register allocation via graph coloring [7] .) The compiler also reserves a portion of the frame as a continuation vector, used at runtime to hold pointers to enabled threads. The continuation vector must be large enough to describe the concurrently enabled threads for a code-block. The global scheduling pool is the set of frames that contain enabled threads.
Structures are heap allocated data objects, accessed through split-phase fetches and stores. That is, the instruction that issues a fetch request does not wait for the data value to be returned, instead the response will initiate a new execution thread [4, 6] .
This allows the processor to be well utilized while remote requests are outstanding. In addition, a synchronization event may take place at the site of the accessed object, so the request latency is unbounded.
Activations
An executing code-block may invoke several code-blocks concurrently, since the caller is not suspended as in a conventional sequential language. Therefore, the set of frames in existence at any time form a tree, the activation tree, rather than a stack, reflecting the dynamic call structure (see Figure 2) . We refer to a frame and the set of threads executed relative to the frame as an activation. The basics of this parallel call scenario are well described in the literature [5, 9, 21] . To allow greater parallelism and to support languages with non-strict function call semantics,l the arguments to a code-block may be delivered asynchronously; each will initiate an execution thread within the code-block. An activation is enabled if its frame contains any enabled threads. At any time, a subset of the enabled activations may be resident on processors, as discussed below.
Threads
Threads come in two forms, synchronizing and nonsynchronizing.
A synchronizing thread specifies a frame slot containing the entru count for the thread. Each fork to a synchron~zing thre~d causes the entry count to be decremented. but the thread executes onlv when the count reaches zero.2 Synchronization occur; only at the start of a thread; once successfully initiated, a thread executes to completion. Fork operations mav occur anv-. where in a thread, causing ad~itional threads to be e~-abled for execution. An enabled thread is identified bv . a continuation-its instruction pointer and its frame. Because the continuation vector is contained within the frame, a continuation is represented simply aa a pointer to the first instruction of a thread.3 A thread ends with an explicit stop instruction, which causes another enabled thread to be initiated, i.e., removes an element from the current continuation vector and transfers control to it.
Conditional flow of execution is supported by switch, which forks one of two threads based on a boolean input value. and case. an indexed fork based on a small integer input. The' compiler is responsible for establishing correct entry counts for synchronizing threads prior to any fork to the thread. This is facilitated by allowing a distinguished initialization thread in each code-block, which is the first thread executed in an activation of the code-block. One of the threads contains a release instruction that causes the frame to be reclaimed: the compiler ensures that this is the last instruction executed for the activation. 
Quanta
Given that the execution model supports a tree of activations, many of which may have several concurrently enabled threads, the fundamental concerns are where frames reside in the storage hierarchy, how the pool of continuations is represented, and how threads are scheduled. Surprisingly, these concerns have received little attention in the threaded execution models discussed in the literature; TAM was developed to address these issues directly. The key observation is that the activation tree and the continuation pool are typically quite large, except on toy programs. This has been demonstrated empirically for programs in Id[8], Sisal [23], and Multilisp[18] . Minimizing the activation tree size while exposing sufficient parallelism is an active area of research, but even with advances in this area we cannot expect the entire activation tree or the entire continuation pool to be maintained in high-speed processor storage. Therefore, the scheduling mechanism must recognize that only a subset of the activation frames are resident on a processor and that a large number of continuations will exist for non-resident frames.
The storage hierarchy is explicit in TAM. In addition, scheduling is explicit and reflects the storage hierarchy. In order to execute threads from an activation, the activation must be made resident. Only a limited number of activations may be resident. When an activation is made resident on a processor, it has access to processor registers. Furthermore, it remains resident and executing until no enabled threads for the activation remain. The set of threads executed during a single residency is called a quantum. Recognition of this intermediate level of scheduling is a major departure from dataflow oriented execution models, such as ETS [22] and P-R1sc[21] , and is key to an efficient implementation on conventional machines.
A non-resident frame may accumulate several continuations, say as arguments are supplied, and when it becomes active all of these threads are executed, as well as any that they enable. Processor registers are essentially an extension to the frame with a lifetime of a single quantum. They may carry values between instructions within a thread or across threads in a quantum. All threads enabled by fork instructions are guaranteed to execute while the frame is resident. Therefore, the continuation vector is split into two parts: a remote continuation vector, used to hold continuations generated external to the activation, and a local continuation vector, used to hold internally generated continuations. The remote continuation vector is part of the frame whereas the local continuation vector should be viewed as a stack of continuation registers. Fork and stop translate into instruction pointer push and pop-jump respectively.
Quantum boundaries in TAM are visible to the compiler. When an activation is made resident, a distinguished initialization thread executes before any threads in the remote continuation vector. In the simplest case, the initialization thread for the first quantum of an activation will establish entry counts for the synchronizing threads and nullify the initialization thread for later quanta. Similarly, a distinguished completion thread is executed when the local continuation vector is otherwise empty. Again in the simplest case, this will extract a new enabled frame from the scheduling pool, make it resident, and transfer cent rol to the i nit ialization thread for the new quantum. This mechanism allows the compiler to control the use of processor regis- ters. Frequently referenced frame slots maybe "cached" in registers, with the initialization thread set up to restore them from the frame. Values judged likely, but not proven, to have a lifetime of a single quantum may be kept in registers, with the boundary threads configured to save and restore them. (Our compiler does not yet exploit this capability, but we expect this quasi-dynamic scheduling will prove quite important in the long run.)
The representation of the scheduling pool is not specified by the model, but determined by the compiler. Sufficient space is provided in each frame to represent whatever data structure the compiler uses to organize the pool, e.g., queue, distributed queue, priority tree, etc. The compiler may even elect to specialize the structure to reflect different scheduling policies in different portions of the program. resource (a frame or a structure), or accessing a remote structure.
In each of these cases, the external entity needs to transmit one or more values back to the activation, cause them to be deposited in frame slots and cause a thread to be scheduled to indicate the arrival of the values. Handling of such a response is usually viewed as a machine primitive.
For example, in P-R1sc[21] an I-structure fetch instruction issues a request to the hardware module containing the location to be read, passing the frame pointer, slot number, and thread pointer. If the location is empty, the request is stored in the memory module until the location is written, When the location is full, the value it contains is sent back to the requesting processor, along with the frame, thread, and slot information. The hardware is expected to interpret this message, store the value in the specified slot and schedule the specified t bread. On Monsoon [22] , only a frame pointer and instruction pointer need be carried with the request, since the slot number is specified by the instruction. However, this relies on presence-bits associated with each frame slot.
TAM attempts to minimize the amount of information carried on such messages and to minimize the in-2.5
Inlets and split-phase operations Thus far, we have focused on the interactions of threads internal to an activation. We now describe TAM facilities for inter-frame interactions, i.e., passing arguments to an activation, returning results, allocating a remote terpretation required upon their reception. To this end, each code-block has a set of inlets, in addition to a collection of threads. The inlets define its external interface. By convention, the low numbered inlets are used to receive argument values. The compiler generates an additional inlet for each value returned to the activation from a subordinate activation and for each reply to a split-phase request. Thus, the I-structure fetch instruction sends the frame pointer and inlet number (in addition to the address to be read) to the remote memorv / " module. Each inlet is a simple sequence of instructions that extract components of the corresponding message, store values into slots of the sDecified frame and Dost threads in the corresponding~ontinuation vector.' In other words, an inlet is a specialized message handler for exactly one kind of message. It is like a thread, but extremely limited in its capability and may interruDt the currentlv active thread. An inlet can, however. h&dle messages"of arbitrary length, network interface permitting.4
2.6
Comparison with other models
The basic structure of TAM is similar to several of the multithreaded architectures derived from dynamic dataflow, with some key differences. Iannucci's hybrid architecture [16] has a similar storage hierarchy; instructions mav refer to mocessor registers or to slots in the current frame. However, the hy=brid proposal associates mesence bits with frame slots and when a thread at-. tempts to read an empty slot it is suspende~the continuation for the thread is placed in the empty slot and rescheduled when the slot is written.
Registers vanish at a point of potential suspension. To allow multiple references to frame slots, the hardware must suDDort lists of suspended continuations.
. . Monsoon[22] associates presence bits with frame slots, but, when a thread attemDts to read an emDtv slot the 'value carried on the to~en is written into~~e slot and further processing of the instruction is cancelled. The data is~icked-u~again when the instruction is re-.= enabled by another token, at which point the instruction executes and enables one or two further instructions. The addressing capabilities of Monsoon instructions are rather limited: onlv one frame slot and the data value carried on the cu~rent token can be referenced. P-Rise, which strongly influenced TAM, uses frame slots for synchronization, rather than presence bits, and makes the synchronization operation explicit. However P-Rise does not recognize a storage hierarchy (there are no registers) or scheduling hierarchy (the next thread may come from any frame).
Recent dataflow machines [12, 22, 25] allow several instructions, i.e., a thread, to be enabled by a single dataflow synchronization.
Registers are used to hold values with a lifetime of a sinde such t bread, as these values need never be placed in the frame. ' across threads, which severely restricts the usefulness of registers, as shown in Section 3. Also, the set of enabled threads is maintained in a special hardware token queue. Several multithreaded architectures have been proposed as generalizations of conventional single-threaded machines, with registers sets (i. e., frames) multiplexed to hide memory and communication Iatency [l, 14,17,27, 29] . In most cases, only one thread of execution per frame is supported.
Thus, each outstanding reference haa an entire register set standing idle behind it. With the exception of MASA [14] , the number of frames per processor is static, thus the mechanism does not directly support language models with dynamically generated parallelism. By viewing memory as split-phase transactions, TAM allows multiple outstanding references per register set and minimizes the number of register set switches.
Although TAM does not rely on multiple register sets in hardware, it could directly benefit from them.
Summary
To summarize the abstract machine, the program is represented by a collection of code-blocks. The state of the computation is described by the activation tree, the heap of structures, and the extended "processor state" of the resident activations.
The scheduling pool consists of a distributed data structure containing those frames with non-empty continuation vectors. Processors execute threads from their resident activations as long as possible. The abstract machine provides four levels of synchronization of increasing cost and decreasing frequency: simple sequencing of instructions within a thread, scheduling of threads generated internal to an activation, scheduling of threads generated external to an activation, and scheduling of quanta. The first two are represented directly in TLO instructions; the latter two are synthesized in inlets and boundary threads. When an activation becomes resident, as much work as possible is done on the activation, using the least expensive form of synchronization wherever possible. Thread-to-thread and activation-to-activation transitions are explicit in the model, so they can be controlled to a large extent by the compiler. The compiler produces specialized message handlers as inlets to each code-block.
Appendix A describes TLO, a simple instruction set that embodies this model and provides a basis for comparison with other execution models.
Preliminary Measurements
To validate the TAM paradigm, a prototype back-end was developed for the MIT Id88 compiler that generates TLO code, rather than dataflow graphs. A second, rather simple compiler translates the TLO code to C, and a conventional C compiler is used for final machine code generation. This approach provides conservative execution times for a TAM-based implementation of a language requiring fine-grained dynamic synchronization. Timing measurements for a collection of small programs indicate that this implementation of Id is rough] y competitive with conventional languages, i.e., between C and Lisp on a single processor for similar programs. Further improvements will be realized in a more sophisticated version of the compiler, currently under development, by producing machine code directly and by more sophisticated generation of TLO. High quality compilation for conventional machines provides a baseline for assessing the performance of novel architectures with dynamic instruction scheduling.
In expanding the TLO code to C, instrumentation code is inserted to collect TAM-level statistics.
In this section, dynamic instruction mix measurements are compared with measurements of the MIT Tagged-Token Dataflow Architecture obtained with the Id World instruction-revel emulator. This shows that TLO instructions are essentially at the same level as TTDA instructions and demonstrates the benefits of compiling graphs into threads. Finally, dynamic TLO instruction and scheduling data on large Id program runs is presented and used to derive the net cost of synchronization under this approach and with direct hardware support.
3.1
Performance of Id on stock hardware Table 1 shows elapsed user time for several small programs written in Id, Lisp, and C and executed on a MIPS R3000. The Id programs are compiled to TLO, which is translated to C and compiled with optimization level "-02". No inter-thread register usage is exploited. The Lisp programs are compiled using Allegro CL 3.1.0 (speed 3, safety O). User time reported for the Lisp program is partitioned into program execution time and garbage collection time (separated by '+" in the table). The C programs are compiled with MIPS CC version 2.11 wit h optimization level "-03". This data indicates that the net gain anticipated from hardware support for dynamic synchronization is considerably less than previously believed. Stock implementations need not execute at "interpreter speeds" if the compiler structures the representation of the program to minimize the cost of the most frequent forms of synchronization.
The data in Table 1 should be considered in light of the different implementation requirements of the languages. Id is a strongly typed, polymorphic language; thus no data tagging is required at run-time. However, the current version of the type system does not distinguish between integers and reals, so in the Id implementation all numbers are represented as 64-bit IEEE floats. Id version 90.0 remedies this situation. The TLO code is translated into very unusual C, which defeats the optimization and register allocation heuristics of the C compiler. Direct translation to the host instruction set could yield considerable improvement. Common Lisp is dynamically typed and the implementation represents values as 32-bit tagged quantities, thus floats are accessed indirectly.
In C, types are explicit, neither dynamic typing nor dynamic synchronization is required. Data structures in Id are non-strict, i.e., the structure can be accessed before all its elements are defined, and require element by element synchronization.
The current TLO implementation uses an array of tag bytes to represent the status of an array of values. MMT is a simple matrix operation test; it creates two double precision identity matrices, multiplies them, and subtracts a third identity matrix.
Performance of the Id version is within a factor of five of the C program and substantially faster than the Lisp program. The C compiler comprehends the simple inner loop and unrolls it four-fold. These optimization are not attempted on the C generated from TLO. The current version of the Id compiler generates many unnecessary moves that are easily eliminated.
To simulate the compiler improvement and the impact of the more precise type system, we improved the inner loop by hand, bringing the time down to 57 seconds. Id is faster than Lisp only when GC time is included. The poorer performance of the Id version is due, in part, to our conservative thread partitioning in conditionals and the high frequency of calls, which currently involve a request to the UNIX malloc for activation frame storage.
The final two columns show user time on an array selection sort. The function used to compute the key is passed as a argument to the sort routine. The Id language system is designed to deal with higher order functions and is able to perform optimizations that would be difficult in the other languages. For the final column, we optimized these programs by hand; the Id version improves only slightly, while the other improve considerably.
Threads versus dataflow graphs
To demonstrate the impact of compiling to threads, Table 2 presents dynamic instruction frequencies for Id programs compiled to TLO and compiled to graphs for the MIT TTDA [5] .
In addition to the small programs discussed above, this includes two larger programs. Gamteb is a Monte Carlo neutron transport code. It is highly recursive with many conditionals. Simple is a hydrodynamics and heat conduction code widely used as an application benchmark, rewritten in Id [3] . The TTDA numbers were obtained using the IdWorld graph interpreter, with the same suite of arithmetic operators, structure operations, and the same resource management operations as in TLO. In the instruction counts, the STOP Table 2 show the instruction counts broken down into classes as a percentage of the number of TTDA instructions, i.e., the TLO entries are normalized to the TTDA counts. The middle section represents the essential computation; the counts should be roughly equal under the two execution models. (The difference in arithmetic counts on MMT is due to operations that convert the index used in structure operations to an integer, which is not required under the tagged execution model of the TTDA.) The lower section includes the control and data movement "overhead" operations, which differ in the two models. In the TAM control-flow paradigm, it is necessary to move values to locations where they will be accessed, for exam 
Scheduling hierarchy
To demonstrate the impact of the TAM scheduling hierarchy, 5On the other hand, the ability to separate the fork from the stop provides a generalization of delayed branch and may be advantageous to support directly.
6The read-modify-write process associated with the match may also stretch the machine cycle, so a small multiplicative factor should be employed to account for this ahd an average miss penalty should be included.
Still, the basic argument is unchanged. 
