Multithreaded processors support a number of execution contexts and switch contexts rapidly in order to tolerate highly latent e v ents such as external memory references. Existing multithreaded architectures are implicitly based on the assumption that latency tolerance requires massive parallelism, which m ust be found from diverse contexts. We have carried out a quantitative analysis of the e ciency of multithreaded execution as a function of the number of threads for two important classes of memory systems: conventional o -chip memory and symmetric networks. The results of these analyses show that there are fundamental reasons for the e ciency to grow v ery rapidly with the number of threads. This in turn implies that the original goal of latency tolerance can be achieved with only a limited number of threads that can typically be drawn from the same referential context and do not therefore require the heavyweight hardware solutions of conventional multithreading. A novel dynamically scheduled RISC architecture is presented based on this new understanding of the problem.
Introduction
There are many constraints on computer architecture; technological limitations and instruction set compatibility being two of the most tight. In this paper we propose and justify some future directions in RISC based processor design which provide solutions to a fundamental problem encountered in a wide range of computer systems; that of statically scheduling concurrent operations in order to avoid high-latency and non-deterministic events. Our solution to this problem, namely to remove the necessity of a static schedule, is not unique. However, the manner in which this is achieved and the impact that it has, due to the very small overhead required, are we believe, quite novel.
Let us rst consider why s c heduling, whether dynamic or static, is such an issue in todays computer systems. Even if we talk of a sequential or uni-processor systems there is always concurrency to exploit. All processors these days are pipelined and many support multiple issue of instructions either in a programmed manner VLIW or an implicit manner Superscalar. Of course this also implies that compiler writers, even when compiling sequential code, must analyse dependencies and extract concurrency in order to exploit these features. Ideally a compiler writer should know the exact behaviour of every instruction executed, in order to produce the perfect schedule, i.e. one which o v erlaps su cient independent instructions to overcome high latency operations, for example, when performing a memory fetch or oating point division. Unfortunately this is not the case, indeed technological considerations, such a s divergence between memory and processor performance on and o -chip performance, mean that this is unlikely ever to be the case again. Non-determinism in the execution of computer instructions occurs in many areas:
in cache based memory systems, where a cache miss will mean the di erence between waiting just a few cycles or many cycles for a memory reference to be satis ed, branches in control, which m ust be predicted if instructions are to be prefetched and decoded, or network based shared-memory systems found in parallel computers, where accesses have a v ery high latency aggravated by a signi cant dispersion due to contention in the network.
Any misprediction in these areas will destroy a static schedule and may h a v e severe consequences on overall performance. Data ow research 12 was thought b y many to provide the solution to the scheduling problem but the solution it provides is far too general, not only does it support a dynamic schedule but also dynamic parallelism, which is not required in compiling most imperative programming languages. Moreover, the overheads are high and the resulting processor designs tend to require deep pipelines which in turn make the solution inapplicable to the majority of installed codes, from which only modest levels of parallelism may be extracted. More recently, many other techniques have been proposed which enhance the conventional RISC based approach to processor design. These solutions either attempt to predict the non-determinism in the areas of cache accesses and branching or try to mitigate against the e ects of misprediction. Branch prediction is a technique which has been used for some time in existing microprocessors but which is still being re ned 9 . Cache prefetching and data streaming 1, 6, 10, 11, 15, 19 attempt to ensure that the required data is pre-loaded into either the cache memory or a dedicated stream bu er prior to a memory fetch being issued. Yet another approach taken is the lockup-free cache 18, 21 . Often however, these techniques introduce further speculation, such as that involved with prefetching 8 , which can, as has been demonstrated 16 , have a n e v en more detrimental e ect on performance in the event o f a misprediction.
It seems clear to us that computer architects are looking in the wrong direction down the arrow of time and instead of designing computers that try to prejudge a program's data accesses or branches they should simply look at tolerating the latencies involved. This paper demonstrates how this may b e a c hieved as well as showing that it need not require massive concurrency as is often thought to be the case.
The alternative approach i s m ulti-threading, which is far from being a recent development a s Burton Smith's pioneering work on the Delencor HEP demonstrates 20 . It does, though, seem to have come of age 2 recently. H o w ever, the most up-to-date work work on multi-threading 3, 13 still takes the view that a thread is a lightweight process complete with minimal context, such as stack and registers. Our own approach is that a thread is just a program counter. In both cases, non-deterministic events which are mispredicted, such as branches and loads, will cause a new thread program counter to be executed. In our view it seems strange that many contexts should be thought of as a good basis for non-deterministic thread interleaving, as this will play h a v oc with any locality that may exist within a single context, with consequent loss of performance due to cache misses and memory bandwidth limitations. Indeed the authors can nd no justi cation for this approach whatsoever and demonstrate in this paper that relatively few threads are required in order to obtain good performance from a multithreaded processor and that such a small number of threads may easily be derived from within a single context.
In order to di erentiate this approach from current thinking we will refer to it as microthreading. Of course with the expectation that such threads will be rather small, possibly just a few instructions, it is imperative that the overheads for fork, join and synchronisation are extremely low. In the section 2 we outline our approach to designing a modi ed RISC processor, which supports dynamic scheduling through micro-threading. We will argue that for single threaded code this approach need be no less e cient that the conventional single threaded RISC processor it replaces. Moreover, we will demonstrate how v ery simple techniques enable us to introduce micro-threading with little or no overhead in terms of instructions executed. In section 3 w e develop an analytical model for both conventional and network based memory systems which con rms some previously simulated results on network based systems 4 . This model demonstrates how few threads are required to sustain a high proportion of maximum throughput from such systems. This analysis justi es the approach w e h a v e described in section 2. Our goal in this work has been to design micro-threading primitives that might be added to any modern RISC based processor. The resulting design should be completely compatible with existing compiled codes, which should execute with no loss of e ciency. Moreover, this should require as few additions as possible to the instruction set of whatever RISC design it is based on. The techniques proposed will provide very rapid context switching of lightweight threads micro-threads which share a single set of registers and the stack. Our motivation is to support the e cient execution of data-parallel languages 5 . However, due to the economics of the microprocessor market, the processor must execute existing codes with una ected performance and should provide enhanced performance even if sequential" code is recompiled for it, using established techniques such as the exploitation of functional parallelism in expressions, independent statements, loop unfolding, etc. Let us consider the generic pipeline, illustrated in Fig. 1 , which i s t ypical of many current RISC processors. It has 6 stages with bypass buses from the nal three. The rst issue which must be considered, is how synchronisation may b e a c hieved on the non-deterministic events whose latency must be tolerated. On RISC processors which are based on the principle of decoupling memory accesses and processing, by adopting a load store philosophy, it seems an obvious extension to base any synchronisation on additional state associated with the register, e ectively providing a split-phase asynchronous LOAD mechanism.
Split-phase asynchronous LOAD
An asynchronous LOAD instruction can be implemented by simply providing an additional status bit on each register, which indicates whether the register contains valid data or not. Considering our generic RISC pipeline Fig. 1 , a LOAD is executed as an instruction which normally writes a data invalid" state into its destination register. Meanwhile, the data cache read begins at stage 3 and, in the case of cache hit, the fetched data is multiplexed i n to the output of stage 4 overloading the previously prepared invalid" data. If the access misses the cache, then the invalid" data state is written into the destination register. The register will be re-written later by the cache memory subsystem by inserting a bubble into the pipeline or via a dedicated port to the register set. Thus any outstanding memory request must be tagged by the register number into which the data will be written.
We s a y that a memory access is split-phase if it misses the rst level cache. More precisely, a split-phase LOAD is one that can not be satis ed during the life cycle of that LOAD in the pipeline.
Any instruction which reads an invalid data" state from either of its operands stalls at stage 2. It can either wait for the data bypassed from the next stages or merely keep reading the register le. However, in either case, no further processing may take place until the requested data becomes available. Clearly this situation is undesirable, for although the compiler may be able to insert su cient slack to tolerate the load, any dispersion of latency will destroy the schedule. Below w e consider a mechanism for removing this restriction of a xed schedule.
Dynamic micro-threaded scheduling
To perform true dynamic scheduling of several instruction streams we h a v e t o i n troduce the explicit notion of independent points of control, i.e. the manipulation of multiple program counters PCs by the processor. A PC represents the minimum possible context information that we can keep for a given thread and in the architecture suggested it is the only reference to a thread. Since several threads can be active simultaneously, some explicit storage for their PCs, called the continuation queue, must be provided. This is associated with the instruction fetch logic at the entry of the pipeline Fig. 2 .
In a normal RISC pipe the next address is transferred from the rst stage of the pipe in order to allow the next instruction to follow without delay. Of course, on branch instructions, this normally involves an element of speculation as the direction taken must be predicted and if this this prediction fails any subsequent c hange of state must be cleaned up". We will call this conventional mechanism of transferring control horizontal transfer and the alternative mechanism that we propose here, which acts through the continuation queue, vertical transfer. Any instruction can transfer control vertically, horizontally, both vertically and horizontally or not transfer control at all. Thus we already have a mechanism for the creation and termination of threads. Of course whenever an instruction does not transfer control horizontally, the next PC if any is taken from the continuation queue. This mechanism, combined with a modi ed form of synchronisation described below provides the basic mechanism for latency tolerance.
A thread is created when an instruction is encoded to transfer control both horizontally and vertically and a thread is terminated when an instruction is encoded not to transfer control The rst, if the instruction set allows it, is to use two spare bits in some or all instructions in order to encode the direction of transfer, thus allowing any instruction to create or terminate a thread. The other extreme is to add a pair of instructions to the instruction set in order to perform thread creation and termination explicitly, and then to provide a xed encoding over the remaining instructions to determine whether the instruction transfers control horizontally or vertically. The former is usually preferable but should the two bits required not be available, the latter, which decodes the transfer strategy from the instruction code, can be applied but is not always optimal. The instruction which is responsible for a non-deterministic delay e.g. a L O AD is not necessarily the one that needs to be coded for vertical transfer; it may be bene cial to code its consumer i.e. the instruction that reads the register loaded so that the compiler may insert a sequence of statically scheduled instructions between the two, thus maximising the concurrency available while still minimising the probability of the consumer being put to sleep. In order to simplify the misprediction recovery, it makes sense to issue the vertical PC from a deeper pipeline stage at which the actual branch direction is already known. If the precise arithmetic exceptions are not required, this can be done right after the register le read at stage 3 in our example in Fig. 1 . In this case it is no longer necessary to predict branch direction providing that enough parallelism is available in the code.
Sleep-wakeup synchronization
Now that we h a v e i n troduced a mechanism for dynamic scheduling, it is necessary to revisit the synchronisation mechanism, involving the split-phase load, described above. Any instruction which is dependent on a non-deterministic event, such as access to a register previously loaded from external memory or conditional branch, will normally be encoded to transfer control vertically, pulling another thread into the pipe behind it. Now at the register le read stage stage 3 in Fig. 1 the instruction either completes and transfers vertically to the continuation queue where the thread waits to be scheduled again or, if the data is not present or some event has not occurred the PC itself is written into the register read and that thread is neatly put to sleep. Now all registers are required to be tagged with two status bits indicating three possible states:
1. The register contains invalid data. The register can be set into this state by a special instruction or a LOAD instruction which misses the primary cache.
2. The register contains a program counter. The register contains the PC of a sleeping thread.
3. The register contains valid data. Any other write of a data sets the register into this state.
The way synchronisation is achieved is that when an instruction attempts to read an operand containing invalid data, the instruction is aborted and transformed into a store program counter" instruction which replaces the register's contents by the PC pointing to that instruction. Hence the thread is put asleep and its PC is kept in the register to which data is expected to be written later. The arrival of that data causes the PC to be pushed out of the register and put into the continuation queue. When rescheduled, the same instruction will nd that data in the register its PC has vacated.
These two synchronization actions sleep and wakeup require mutually exclusive readmodify-write access to the register bank. This can be implemented using the dynamic dependency control, stall and bypass logic which i s e m bedded in most RISC pipelines.
When executing non-threaded code, all instructions would transfer control horizontally as there are no additional threads to be executed. This situation is very similar to that of a conventional pipe: a prediction is made in the case of a LOAD dependency the prediction is that the data is in the primary cache and the next instruction is executed. Again if an operand contains invalid data a misprediction recovery is performed, i.e. the continuous chain of horizontally fetched instructions headed by the mispredicted instruction is cancelled. However, when there are no other ready threads available it is preferable to use the stall mechanism rather than the sleep-wakeup mechanism to reduce the penalty i n v olved.
In threaded code, if both operands of a dyadic operation are invalid, we can chose either of them as a target for the thread's PC. However, a xed rule e.g. always the rst can allow the compiler to perform additional optimisation.
The wakeup action is performed explicitly by means of a special move synchronizing" instruction which is inserted into a slot created in the pipeline by stalling earlier stages. This instruction moves one register to another. However, it reads both the source and the destination registers and if the destination contains a PC, that PC is issued vertically later in the pipeline. In fact, any instruction reading only one register e.g. an operation with literal can be used for this purpose provided it does not itself transfer control vertically.
The manner in which a split-phased memory request delivers the data using a move synchronizing" instruction is illustrated in Fig. 3. 
Analysis of overheads
We h a v e shown in the sections above h o w fork, stop, wait and signal may be implemented in a conventional RISC pipeline using its register set as the synchronisation resource. In the introduction we argued that in order for micro-threading to be viable it must be possible to handle threads of a few instructions with no signi cant o v erhead. In this section we analyse the overheads involved in these various actions.
Thread creation
The overhead of a fork depends on detailed design at the instruction set level. Any instruction which has space for an additional address may be encoded to transfer both horizontally and vertically and hence yield a fork at zero cost. Alternatively a variant of a branch instruction may be encoded as a fork. There is also a trade-o in implementation, as it may be necessary for example in a conditional fork to transfer control vertically on both continuations and without additional hardware support this could require two additional instructions. The overhead for a fork therefore, is 0, 1 or 2 cycles, known statically, depending on instruction set encoding and hardware tradeo . The frequency of nding a double vertical transfer in compiled code will determine whether additional e orts should be made in order to keep the overhead bounded by a single cycle.
Thread termination
The overhead for a stop also depends on detailed design but as no transfer of control is required the overhead is 0 or 1 cycle. Again this will depend on whether a special instruction is added or existing instructions are encoded for transfer of control in this case the encoding is for no transfer".
Sleep
This action is implicit in any instruction that reads a register. Storing the PC does not require an additional cycle, as the result of the current instruction is not written if its thread is suspended. However, the suspended instruction must be reissued once the dependency which put it to sleep is resolved. The overhead for a sleep instruction is therefore 0 or 1 cycle, which i s not known statically, for it will depend on whether the register contains invalid data" or not. More importantly, the overhead for a valid prediction is 0 cycles.
Wakeup 9
A w akeup signal may be generated internally or externally, as is the case with a split-phase load. An internal signal may again require a separate instruction, although it can be encoded as an option in any instruction which reads no more than one register and does not transfer control vertically. This latter condition will often be satis ed as signalling is typically performed as the last action of a thread, when it will transfer neither horizontally nor vertically. T h us for an internal signal the overhead is 0 or 1 cycle, known statically. Again there is a hardware tradeo in the case where no additional instruction is added, as by providing an additional register port the restriction of reading no more than one register may be removed, thus completely eliminating the overhead of internal signalling.
An external signal, such as split-phase load, must create a slot in the pipeline in order to insert the move synchronizing" instruction. Thus the overhead here is always 1 cycle.
Summary
It is clear from the above analysis that micro-threading may b e a c hieved with little overhead. In the case where existing instructions are overloaded with transfer method, the overheads of fork, stop, internal signal and successfully predicted wait can be eliminated entirely. In the case where additional instructions must be added to an existing instruction set, the overheads for all threading operations may be limited to a single additional cycle. The following section now demonstrates the viability of nding su cient threads within a single context to maintain a high percentage of peak sustainable performance.
On the required number of threads
The question we m ust ask ourselves now i s h o w many threads are required to tolerate the latency found in typical memory systems, as this will determine whether micro-threading is a viable architectural model. In order to answer this question, let us simply study performance as a function of the number of threads, as the ultimate goal is maintaining a high percentage of maximum possible performance. In this analysis, we will see that, contrary to the common belief, latency of the memory system is not the only factor that in uences this function. Latency has to be considered in combination with other factors including the program behaviour.
What then is the correct measure of performance? We are looking for a function P n that depends on the program as a whole, on the number of threads in its multi-threaded representation this is the explicit parameter of the function, and on the characteristics of the memory system.
We will introduce P n in the framework of the simpli ed model presented in Fig. 4 . The program tries to read its data from cache writes are considered non-blocking and are therefore unrelated to the problem in question. We assume that a cache miss suspends the current thread and causes a request to the memory system. Although we do not specify the nature of the memory system at this stage, we presume that it has some throughput limitation and some latency. By observing the execution of a given program on such a system, we can determine the following parameters:
1. Average memory throughput G measured in requests per unit of time granted to the program. This parameter depends both on the technical characteristics of the memory system and on the program's behaviour: a program that has high degree of temporal locality and therefore receives most of its data from cache will cause low throughput on the memory channel.
Average latency
LG of the memory system. It, again, depends both on the memory system as well as on the program indirectly via G. For example, LG tends to grow with G: the heavier is the workload, the longer are the delays caused by queueing.
3. The maximum sustained throughput T max of the memory system. It does not depend on the behaviour of the program, but restricts the average memory throughput granted to it: G T max .
As G and L depend on the program's behaviour, we need an adequate measure of it. In order to establish this measure, let us return to the original problem of compensating for memory delays. The most natural way to assess e ciency of such compensation for a given program is to see what would happen if there were no delays whatsoever.
Let us temporarily replace the actual memory system by an imaginary instant" one that does not limit throughput and responds with data in zero time Of course, it may be argued that R depends on a number of factors including locality, which may not be constant but vary with the number of threads in the program's representation. However, what we are interested in is the performance of a system for a given R but with varying number of threads. Micro-threading in any case will minimise the e ects of data locality as a function of number of threads.
In order to simplify the analysis we assume that the probability of a load instruction causing a cache miss is a constant which re ects the degree of temporal and, to an extent, spatial locality of the program as a whole. In particular, this probability does not depend on the number of threads in the multi-threaded representation of the program, nor on the manner in which those threads are scheduled for execution, nor on the nature of the memory system. Under this assumption, the statistics of memory requests can be approximated by the Poisson distribution: at each clock cycle of the processor a request is issued with some constant probability if throughput is measured per clock cycle then this probability i s R , though our analysis is invariant to the choice of time unit. The underlying logic is as follows. According to the RISC community folklore, on average one in 3 4 instructions is a load, which makes it a fairly frequent e v ent. As no memory system is capable of handling such w orkload, caching is essential and the average probability of a cache miss is much less than 1, which means that memory requests are, conversely, infrequent in both cases the time scale is given by the clock cycle. Moreover, in a multithreaded processor the actual order in which the instructions are executed is randomised by the dynamic interleaving of threads: see Fig. 6 . Therefore, the event of a memory request i.e. a load instruction that causes a cache miss occurring at any given clock cycle is of predominantly probabilistic nature, and treating the statistics of memory requests as Poissonian is justi ed. We can now de ne the performance function as processor utilisation that varies between 0 and 1:
1 The average proportion of memory requests per executed instruction is the same with or without delays due to our assumption that the probability of a cache miss is constant. Therefore, the ratio of the two throughput values equals the ratio of the corresponding numbers of executed instructions per unit of time. As R is the memory request rate in the situation when there are no delays and therefore the processor is utilised completely, P n as de ned above is the degree of the processor utilisation achieved by the program.
Note that if R T max then the processor utilisation can not exceed T max =R, a s G n T max for any n umber of threads. This is what should be expected: R T max means that the program consistently issues more memory requests than the memory system can handle; in such situation, the processor is bound to remain idle for some proportion of its time.
We h a v e reduced the problem to nding the function Gn which, of course, depends on the program behaviour R and on the characteristics of the memory system; these characteristics, in turn, may be dependent o n G . N o w w e will obtain an equation that links all the relevant parameters.
Our general model of the memory system shown in Fig. 4 may be treated as a black b o x that may contain un nished transactions. The average number of transactions being simultaneously processed by the memory system is given by the product G L G . In the following analysis, we will ignore the discrete nature of the processor, i.e. the fact that the time di erence between two consecutive transactions coming into the black b o x can not be less than one processor cycle.
Each un nished transaction represents a sleeping thread. We assume here that a thread is ready for execution unless it has placed a request into the memory system. Let S be the probability that a particular thread is sleeping. The average number of sleeping threads in On the other hand, the probability of the processor being idle because all threads are sleeping is equal to the normalised performance loss:
From the 2 and 3 we obtain the fundamental equation of statistical balance:
This equation for function Gn has, of course, to be solved numerically. But before this can be done, we h a v e to know h o w latency L of the memory system depends on the granted throughput G. In the following section we will address this question for two important t ypes of memory systems: conventional memory and network.
Latency as a function of granted throughput

Conventional memory
In the conventional memory system see Fig. 7 , the latency L 0 of the memory itself does not depend on the request rate. However, as the throughput of such memory is limited by the constant T 0 , requests have to be queued and the total latency is the sum of the queueing time and latency proper:
LG = L q G + L 0 : 5 We assume here that memory cycle begins immediately after the rst request has arrived to the queue and that the memory takes a request out of the queue for service. The average queue length is therefore equal to the number of waiting customers in the system with deterministic service time 1 The maximum sustained throughput is in this case de ned by the hardware characteristics of memory: T max = T 0 .
Network Network model
When analysing the network, we will concentrate only on the fundamental features of the network structure itself, not taking into account such factors as contemporary technology constrains etc. In our model, the network is synchronous and messages make one hop in one network cycle. We assume that the network is direct and each node is connected to each of the K adjacent nodes by a pair of channels: input and output. Routers have t w o queues for each output channel: one for transit messages and another for messages injected by the processor connected to the router. The transit queue is granted the higher priority, that is the internal queue is served only when the transit queue is empty in fact, relative priorities of transit and locally-injected messages make v ery little di erence for the nal result, but the above convention simpli es the calculations. The network is presumed to be loaded by random uniform tra c, and the routing strategy comprises a random choice between the channels pre xing any shortest path to the target.
Let D be the doubled average distance between two randomly chosen nodes in the network for symmetrical networks it is equal to the diameter. Each transaction is assumed to be of request-reply type remote memory LOAD or STORE with con rmation. Thus, each processor e ectively sends a message to itself via random route; the average message hop count is therefore equal to D. W e ignore the request processing time at the target node assuming that it is done by independent hardware. So, from the tra c point of view, the request passes the target node as a part of transit tra c. In order to understand basic properties of the network acting as a memory subsystem, let us nd an approximation of the number of requests from a given node that can be simultaneously active in the network.
Since any message makes on average D , 1 transit hops before being consumed at the nal hop, only 1=D part of the incoming messages are consumed by a n y given node while D ,1=D part are forwarded 2 . Due to the statistical balance between the incoming and outgoing transit ows, the proportion of the node's own" messages in the outgoing tra c is also 1=D: see Fig. 8 .
In the following analysis, we take the network cycle for the unit of time. Let r 1 b e the observed average tra c rate in the network; due to the random uniform nature of tra c, this average gure is the same for all communication channels. According to the proportion between transit and injected consumed tra c ows discussed above, the rate of transit tra c For a given processor, each output channel of the respective router can be considered a black box with the input rate of r=D and some latency which is proportional to the average number of hops D at this stage we are ignoring initial queueing of requests at the originating node. So, as a rst order approximation, we can conclude that the number of messages contained in all K channels is
9 This means that, contrary to intuitive expectation, the average number of requests from a given processor contained in the network does not grow with the size of the network. The reason is that due to the necessity to support transit, the granted throughput of each outgoing channel falls in inverse proportion to D while average latency is proportional to it. Note that this fact is invariant to the degree of spatial locality, since variation of the locality scale leads to the recalculation of the average route length, D, that does not a ect the gure.
In order to nd the actual form of the function LG for the network, we h a v e to take i n to account queueing times of both the internal queue and the transit queues.
More accurate analysis
The model of an output channel is shown in Fig. 9 . Each message spends some time L q in the internal queue at the original node and then performs an average of D hops. Each hop takes time L 0 ; a s w e h a v e c hosen the network cycle for the time unit, L 0 = 1. On its way, the message passes an average of D , 1 transit nodes and spends some time L t in the transit queue in each of them. Therefore, the total latency is given by the following formula: The transit queue is served with deterministic service rate T 0 which in the chosen units of time is, of course, equal to 1: one hop at each network cycle. Thus, a non-zero transit queue length can be produced only by the collision of several messages queueing at the same output channel. Unfolding the instantaneous queueing process in discrete time, we observe that we are e ectively dealing with the average numberof waiting customers in a system with deterministic service rate see 17 . Taking into account the discrete nature of the process, we obtain
Note that the network tra c rate r and the granted throughput G have a n o b vious relationship see where T 0 is the physical throughput of one channel this formula does not depend on the choice of the unit of time.
Combination of memory and network
LG for the combination of memory and network which is the architecture of contemporary massively parallel computers depends on the degree of spatial locality; the two systems considered above correspond to full locality and zero locality, respectively.
Numerical solutions
Functions P n corresponding to the results of solving equations 7 and 14 numerically are presented in Fig. 11 and Fig. 12 , respectively. The rst of these gures corresponds to conventional memory, the second to nearly any feasible symmetric network note that for the network the number of threads is expressed per bidirectional channel in a router. For very small networks the graphs for P n deviate from those shown in Fig. 12 due to the impact of terms of the order of D , 1=D, but generally P n has very little to do with D | a s i t w as suggested by the approximate formula 9. Figure 12: P n for network As expected, in our contiguous model the asymptotic performance 1 for R T max and T max =R for R T max is possible only with in nite number of threads. However, the 80 level with respect to the theoretic maximum is achieved with just a few threads for the conventional memory and with about 2 threads per channel for the network.
The conclusion that can be made from these results is that using heavy-weight m ultithreading for tolerating latency is not justi ed. The limited number of threads that is su cient for fairly e cient execution should apparently be drawn from one referential environment with common registers and common stack, which will provide for low-overhead management of threads.
Similar conclusion was made by Culler in 7 . However, his analysis of the maximum number of outstanding requests is based on the technological limitations of existing network-based memory systems. Our 80 at 2 threads per channel" gure has nothing to do with those limitations. It is a fundamental implication of the fact that the majority of random uniform tra c is in transit.
20
In section 2 we h a v e i n troduced mechanisms which can be applied to conventional RISC architectures to allow micro-threading, i.e. multi-threading within a single context. We h a v e also shown that the cost of the primitives required is small and that we can expect to obtain a substantial fraction of peak performance with surprisingly few threads. However, there are a n umber of issues which m a y provide problems when compared to conventional designs and these are discussed below together with some possible solutions and future developments.
Spatial locality of instruction stream
The rotation of several threads in the pipeline obviously damages the spatial locality of instruction cache access. However, the e ects of this in micro-threading are less likely to be observed than when threading on larger contexts. One straightforward solution is to use a fast, associative level 0 cache or re ll bu er that can keep track of several points of control. However, when we miss all levels of instruction cache memory, a long stall is unavoidable. In the dynamic scheduling model we can go further. Since the hardware is not obliged to take PCs from the continuation queue in any strict order, several instructions from di erent threads can be fetched simultaneously and issued at will. This does however complicate the level 0 cache design.
Interrupts and context switching
Interrupt hardware in this architecture is more complex. To be able to stop the pipeline instantly, w e require a recovery mechanism that can both track d o wn the vertical PC issued by a speculatively executed instruction and keep the list of threads being executed. There is a range of simpler solutions a ecting the interrupt response time and the complexity o f i n terrupt processing.
The continuation queue and the register tags put an additional burden on context switching. The tags can be organised as a separate memory structure that can be accessed in two w a ys: horizontally together with a register and vertically 32 tags per word.
Overcoming a temporary narrowing of parallelism When the available parallelism is dynamically not large enough to tolerate the internal pipe delay, w e can apply a number di erent strategies in order to perform at least no worse than the conventional RISC pipeline.
The simplest solution is to replace vertical transfer of control by horizontal one if the continuation queue is empty. In many cases, it is better to stall on possible data dependency.
In the most complex solution, any PC, no matter whether it is issued vertically or horizontally, is forwarded immediately to the parallel multi-fetching logic, which is shared by all the processing pipelines. The instruction which can be processed with the shortest stall time is issued rst. Conditional branches are predicted only if postponing the decision damages the overall performance i.e. there is lack of parallelism. A full-power back-trace recovery mechanism has to be provided if this approach is adopted. This mechanism prevents an instruction from any unrecoverable action until the direct ancestor of the instruction has reached the retirement point. Such a mechanism is feasible, but keeping it out of the critical paths is a non-trivial design challenge.
Code generation
In the conventional memory case, it is enough to have 3-4 threads available for execution at any given moment see Fig. 11 . This is clearly possible in principle for the vast majority o f applications: existing code generators for conventional RISC processors successfully extract about the same number of statically-scheduled threads from one referential context in order to compensate for pipeline delays otherwise RISC architecture would not be viable. With dynamic scheduling, the task of identifying independent threads becomes even easier, but special care must be taken in order to avoid having too many active threads at one moment and too few at another. Too many threads will impact performance by loss of locality. One key issue we will be investigating is the choice of scheduling strategy for ready threads other than rst come, rst served with a code generation logic geared towards providing a right kind of threads mix for the scheduler.
When a microthreaded processor is used with a network-based memory system, the number of threads required for maintaining reasonable e ciency is higher: see Fig. 12 . In this case, an obvious source of threads is provided by data parallelism, where each data-parallel expression is e ectively a generic template of scalar threads that share the referential context. Invariably, the number of actual threads generated will be limited by the resources available, i.e. by the number of registers. Therefore, microthreaded processors designed for distributed computing should provide enough registers to make it possible for the code generator to satisfy the "two threads per channel" requirement.
Conclusions
This paper has introduced and justi ed novel architectural techniques for micro-threading, which w e de ne as multi-threading within a single context. This solution has an extremely small cost in terms of additional cycles required and can be implemented o v er conventional RISC designs with few modi cations to the instruction set. An analysis has shown that this solution to latency tolerance is quite viable in both conventional and network based memory systems, due to the small number of threads required. The results of this analysis are surprisingly insensitive to architectural parameters because they are predicated on two fundamental facts, namely the exponential fall of the probability of stalling with the number of threads and the proportion of transit tra c found in a router node. These rst order e ects both contribute to the small number of threads required to reach a substantial fraction of the maximum possible performance.
This work has the ability to impact over the complete range of architectures used commercially today. I n c heap PC based systems, where little or no cache is used, concurrency may b e used to mask the relatively high frequency of accesses to main memory and thereby mitigate any performance loss. This is also the case where cache is available but that the nature of the problem or algorithm means that locality i s v ery di cult to nd. At the other end of the scale micro-tasking provides an architectural means to tolerate the latency and dispersion found in network based shared memory solutions. We h a v e shown in our work on the compilation of of data-parallel languages 5, 14 that we can exploit the parametric parallelism found in this style in generating and if necessary throttling the large number of threads that this programming paradigm yields.
