The multithreaded processor - The access of remote data and the synchronization of threads cause processor idle times. It is the object of our arch to fill these idle times by switching extremely thread of control. We further implement onization primitives that prevent busy waitsor should be able to bridge memory latencies and synchronization waiting times so efficiently that it could also be applied in switch buffer is used. The loadstore unit shows up as the principal bottleneck. We evaluate four implementation alternatives of the loadhtore unit to increase processor perfor-"ace.
Introduction
on every instruction, the block multithreaded Processors SParCle of the MIT Alewife machine [4] , MSparc [51 and MTA [61, the multithreaded superscalar processors developed at the Media Research Laboratory of Matsushita Electric Industrial Co. [7] , at the University of California, Irvine [8] , at the University of Karlsruhe [9] , and the Currently standard or application specific microprocessors are used as nodes for multiprocessor systems. Standard microprocessors are developed and optimized for microcomputers or workstations with a single processor or with a low number of processors tied to a common bus. The use of standard microprocessors limits the scalability of shared memory multiprocessor systems unless provisions are made to bridge latencies caused by remote memory accesses or by synchronization operations. Because of the small market segment of multiprocessor systems, designing microprocessors specifically for use in multiprocessors is expensive.
Our research project aims at the development of a processor which is suitable for a node in a distributed shared memory system (DSM) as well as in a uniprocessor system. The storage of a DSM system is physically distributed, but all processors share a common address space. As a consequence, memory access time depends on the locaof the accessed data. The data can be in the processor cache, the local memory, or the remote memory. simultaneous multithreaded processor of the University of Washington [lo] , and the decoupled access/execute architecture DAE [ 1 11, which splits instruction processing of a single thread of control into memory access and execution tasks, executed by different units, that communicate via "architectural queues".
Our approach is most similar to the Sparcle and MSparc processors which switch the context on a cache miss. However, the execution unit of our processor switches the context whenever it comes across a load, store or synchronization instruction, and the loadstore unit switches whenever it meets an execution or synchronization instruction. In contrast to Sparcle, the context switch is triggered by the decode unit in an early stage of the pipeline, thus decreasing context switching time. On the other hand, the overall performance of our processor may suffer from the higher rate of context switches unless the context switch time is very small. Implementation alternatives for a very fast context switch are presented.
The Processor Architecture
The main idea is to remove all operations that may cause active waiting from the execution unit. Therefore, load, store and synchronization operations are performed by different units within the processor. We distinguish idle times caused by memory accesses from idle times caused by synchronization operations. The former depend on the memory hierarchy of a DSM system, and idle times are predictable within a time period varied by network access conflicts. The latter depend on the program execution and are non predictable. We assign a unit for the load and store operations -the lodstore unit -and another unit for the synchronization operations -the sync unit. The execution unit processes the arithmeticlogic and the control instructions. Each of these units executes instructions from another thread. The units are coupled by FIFO buffers and access different register sets. The microarchitecture of the multithreaded processor is shown in figure 1. A unique thread tug identifies the thread. An uctivutionframe is assigned to each thread holding thread-local data, e.g. the program counter, the thread tag, and other state information. The activation frames are physically distributed to the register sets. If more activation frames exist than register sets are available, activation frames of blocked threads are stored in the memory.
Each unit stops the execution of a thread when its decode stage recognizes an instruction intended for another unit. To perform a context switch the unit passes the thread tag to the FIFO buffer of the unit that is appropriate for the execution of the instruction. Then the unit resumes processing with another thread of its own FIFO buffer. The units execute different threads of control. Therefore, they access different activation frames and thus different register sets. A fast context switch is realized by simply switching to another register set. A more detailed microarchitecture description of the multithreaded processor is given in [12] . In the following, we omit the sync unit and concentrate on the load/store and execution units.
Fast Context Switch
In general, using a five stage processor pipeline (e.g. instruction fetch, decode, operand fetch, execution, write back) a context switch is recognized in the decode stage. This unnecessary decoding costs one cycle. We allow for access to the new thread tag and loading the new instruction pointer from the thread tag each by an additional cycle. The first instruction of the new thread is decoded after two further cycles -thus context switching overhead sums up to 5 cycles.
Besides the software simulations presented in this paper, we implemented the Rhamma architecture in VHDL and optimized the hardware towards a fast context switch. We obtained a context switch cost of at most one processor cycle by applying two optimizations:
One technique is to code the context switch explicitly in the first opcode bit of the instruction. A complete decoding is not necessary to recognize a context switch.
The instruction fetch stage already recognizes the context switch itself, and the context switch just costs the cycle to fetch the instruction.
The second technique applies a context switch bufSer, which is similar to branch buffers in modern microprocessors. The context switch buffer is a small table in the execution unit, which holds the addresses of the most recently used load/store instructions. If the address of the next instruction to be fetched matches with an address in the context switch buffer, a context switch is performed immediately. In this case context switching time is reduced to zero: Otherwise the first method is used. Our simulations with real work loads have shown, that only a little buffer with about 32 entries is required.
The context switch buffer is also suitable if the instruction fetch costs more than one processor cycle as usual in modern processors.
sor is the unit executing load multithreaded processor the lo more essential than in a conv ever, allows new possibilities t oadstore bottleneck. We studied four implement * Stalling: The simplest implementation is to issue a load or store request to the memory interface and then wait for the loadstore acknowledg completion of the memory acces instruction is scheduled.
her throughput of data.
: the loadsto each load or ches the thread of st. A load or store instruction of another thread can be scheduled. The succeeding instructions of the switched thread are executed after receiving the acknowledgement corresponding to the memory request.
store requests are sent to the memory. Then the thread tag is handed over to the execution unit or synchronization unit, respectively. The next execution instructions are executed if the instructions are data independent from the Overlapping: One or several 1 instruction, the un itches the thread of control. Depending upon the memory hierarchy access times, cycle times and hit rates of memory and remote memory within a DSM system. We assume split transactions on the network of the DSM system. Therefore, the remote memory cycle time is chosen as firaction of the remote memory access time. Access and cycle times are shown in table 1. We vary the cache hit rate and the local memory access rate. As explained in section 111, the context switch of our VHDL implementation of Rhamma costs at most a single cycle and is reduced to zero if a context switch buffer entry matches. For the software simulations we used a context switch cost of one simulation time step.
As simulation workload we applied several small application programs written in Modula-2. The applications were compiled to the machine language of DLX and to the extended machine language of mamma. For the simulations presented in this paper we chose a benchmark programs. The workload is c 100000 instructions, three threads, and a rate of one loadstore instruction to three execution instructions. The number of data independent succeeding instructions is two. This simulation workload does not contain synchronization instructions.
As can be seen easily, the stalling loadhtore unit performs worst, and the combined approach (bottom right) gives the best performance (note the different scales on the vertical axis). The interleaving and the overlapping techniques are intermediate and not in a direct order to each other. This is because of the convex and concave crookedness of the planes formed by the toPS of the small bars. If we analyze the edges of the planes in the four figures, the following Configurations of multiprocessor systems are represented:
Simulation Results
Various configurations of multiprocessors were simulated. The four diagrams in figure 2 vary the cache hit rate and the local memory access rate. The remote memory WCeSS rate results from these two rates. The Vertical axis shows the yielded simulation time steps for the executed benchmark program. Removing the local memory from the nodes, we reprecache-only DSM multiprocesso rocessor with caches and remote memory (figure 4) . The three more complex loadhtore unit me to access bridge memory latencies. Here, the advantage of a d processor is over multithreading tech rm in the istic cache hit rates from 10% to 60% as as conventional processors with hit rate of 80% or higher. Even ut cache is as good as a conventional processor with a cache hit rate of 65%. Thus, multithreading can replace expensive cache memories.
C. In the diagram in figure 5, cache memory is left out, dl.The configuration d l ( figure 6 ) with a fixed remote thus representing a DSM multiprocessor without memory access of 30% varies the cache hit rate and caches. We see from the diagram, that the interleaving, local memory access rate. All three multithreading overlapping and combined methods are very good approaches, interleaving, overlapping, and their solutions for the problem of latency hiding. Executing combination, show good performance due to the instructions succeeding the lodstore instructions also latency bridging for DSM systems.
hides a small part of latency in the overlapped conventional approach. The Sparcle/MSparc approach (see Sparcle remote curve) stalls on a local cache miss. It can not bridge the local memory access time. If Sparcle/MSparc is modified to switch on local and remote miss, it shows a similar performance as the combined approach.
Increasing the number of nodes in a multiprocessor system corresponds in general to a lower hit rate to the local memory, because data is distributed over the local memories of more processors. The graphs for the interleaving, overlapping and combined methods are nearly horizontal, which shows, that the use of multithreaded processors supports the scalability of the system. The data distribution is not a critical subject, so programs and compilers do not have to care about it. In all six diagrams the Combined approach performs best, the less complex interleaving and overlapping approaches are often nearly as good as the combined approach. The stalling approach is to simple. It often performs better than the stalling conventional processor, but worse than the overlapping conventional processor. The loadstore variants that use overlapping ins the number of data in a lodstore instructi are provided, the wai the execution unit contrast, the over1 down if the number of data-independent instructions decreases. Changing the instruction mix will change the processor utilization. Best utilization will be reached apping and interleaving, n execution, depend on structions following instruction mix given by the equation
IoacYsfore

# instructions cycle time
Our multithreaded process tional processor are based ar RISC processor. It is not easy to compare the simulation results wit hypothetical superscalar processor. However, since a superscalar processor is also equipped with e unit, it is comparable with our multi d containing an execution unit, which is able to issue execution instructio imultaneously from a single thread to several fun a1 units. Simulating a higher issue bandwidth the main problem remai the loadstore bottleneck that can only be widened by better cycle times.
ssor which uses fast latencies caused by memory accesses or synchronization operations. Since the context switch is triggered by the decoding in an early stage of the pipeline, context switching time can be as short as one cycle. The multithreaded processor outperforms the conventional processor by its ability to tolerate memory latencies by executing instructions of another thread. Because of the short context switching time, a load of only few sufficient for increasing performance over a condepend on the acce ss time can be fully time proves as the Cycle times should be shorter than access times. The implementation of the loadstore unit is essential for the overall performance, too. The Rhamma processor with a simple stalling load/store unit performs better than the stalling conventional processor but worse than the overlapping conventional processor because short cycle times can not be utilized. The more sophisticated loadstore unit implementations increase the performance of the multithreaded processor. The combined approach performs e next step after the software simulation we developed a VHDL implementation of Rhamma. We conducted a hardware simulation and synthesis using the Synopsys tools. With the software simulation results in mind we chose to implement the combined lodstore unit and to minimize the context switch overhead, that we could reduce to at most one processor cycle (see section 111). We are working towards a hardware prototype.
