Abstract. This paper describes how a superscalar in-order processor must be modified to support Simultaneous Multithreading (SMT) such that time-predictability is preserved for hard real-time applications. For superscalar in-order architectures the calculation of the Worst Case Execution Time (WCET) is much easier and tighter than for out-of-order architectures. By a careful enhancement that completely isolates the threads, this capability can be perpetuated to an in-order SMT architecture. Our design goal is to minimise the WCET of the highest priority thread, while releasing as many resources as possible for the execution of concurrent non critical threads. The resultant processor executes hard real-time threads at the same speed as its singlethreaded ancestor, but idle issue slots are dynamically used by non critical threads. The modifications to enable SMT include a multithreaded fetch stage, an additional real-time issue stage, a wider register set, a prioritised multithreaded memory interface, split phase loads and interruptible microcodes for multi-cycle operations. The application of these enhancements is demonstrated by CarCore, a multithreaded embedded processor that implements the Infineon Tricore instruction set.
Introduction
The common way to construct an simultaneous multithreaded (SMT) processor is to take a superscalar out-of-order processor and allow it to fetch from multiple threads [1] . This procedure is simple, the fetch stage must be modified and the number of registers should be enhanced, but there are only minor modifications at the internal logic of the pipeline. Despite its simplicity, this combination greatly improves processor throughput [1] .
But there are two drawbacks of out-of-order SMT processors: they consume a lot of chip area and energy and it is hard to predict the Worst Case Execution Time (WCET) because of the dynamic allocation of processor resources, making this kind of SMT improper for hard real-time applications.
We eliminate these drawbacks by taking an in-order superscalar processor as base architecture for an alternative implementation of SMT, called In-Order Simultaneous Multithreading. The Intel Atom processor [2] is a famous representative of this class of SMT processors, although it only benefits of a smaller transistor count and lower energy consumption than comparable out-of-order SMT processors. In contrast, this paper addresses the advantage, which is even more important for embedded systems: Because of the deterministic behaviour of superscalar in-order processors, they allow for tight WCET analyses. By adding strictly prioritised multithreading capabilities that completely isolate threads from each other, the tight WCETs can be preserved, while the utilisation and energy-efficiency of the processor is increased by concurrent threads.
The contributions of this paper are:
-An in-order SMT processor that isolates the highest priority thread (HPT).
-The HPT is executed as if the underlying processor was a singlethreaded superscalar processor to keep the WCET analysis tight. -A detailed description, how a superscalar in-order pipeline must be modified to enable SMT -A prototype of an SMT processor with TriCore instruction set architecture.
The rest of the paper is organised as follows: the next section presents the related work and section 3 explains the TriCore architecture and the differences to the baseline singlethreaded CarCore processor. In section 4 the enhancements to enable SMT are described in detail. Section 5 discusses our evaluation results and section 6 concludes the paper with some future work.
Related Work
Tullsen [1] defined SMT as multithreading for superscalar pipelines. He did not specify the execution order, but he used an out-of-order processor as base architecture and so did most of the later SMT researchers. Consequently, most of the work on real-time and SMT is also based on out-of-order pipelines [3] [4] [5] [6] . But the unpredictability of out-of-order pipelines does not allow hard real-time execution, only soft real-time scheduling is addressed.
Although Hily [7] already showed in 1998 that in-order SMT increases total throughput, while out-of-order execution only boosts one single thread and is less cost-effective, only few studies focus on designing SMT processors with in-order pipelines [8] . Similar results were published by Moon [9] , who discovered, that static partitioning and execution in-order has only little negative effect on the performance while significantly reducing design complexity. Other studies that divide the pipeline into an out-of-order front-end and an in-order back-end [10] or that restrict certain parts of the pipeline to in-order execution [11] approved the advantages of in-order execution.
Zang et al. [8] investigated issue mechanism for in-order SMT processors. Their processor has a 7 stage pipeline and can issue up to 6 instructions from 6 concurrent threads. A well-known commercial processor that has an in-order SMT architecture is the Intel Atom processor [2] with a two-way in-order pipeline. But none of the mentioned works address hard real-time execution, to our knowledge our project is the first on hard real-time for in-order SMT processors.
The Real-time Virtual Multiprocessor (RVMP) [12] issues multiple instructions from multiple threads to multiple pipelines, but it assumes multiple identical pipelines and statically maps threads to pipelines. Therefore multiple hard real-time threads can be executed, but the throughput is not increased, as idle pipeline slots cannot be used dynamically by other threads.
The Precision Timed (PRET) Architecture [13] is another example of a hard real-time capable multithreaded processor, but again the schedule is very static: there are 6 threads and they are executed in fixed order, hence every thread gets exactly one sixth of the execution time. If a thread is stalled, the cycle cannot be used by another thread, as the PRET architecture supports only precisely timed hard real-time threads, no other threads with softer timing demands can be executed to increase throughput.
Baseline
Exemplary we use a TriCore compatible processor to present the SMT enhancements, but they can easily be transfered to other superscalar in-order architectures. TriCore-specific parts are explicitly marked.
TriCore Architecture
The Infineon TriCore [14] is a microcontroller that is commonly used in safetycritical applications of the automotive industry. It combines a real-time capable load-store microcontroller architecture with DSP instructions. The instruction set comprises more than 700 instructions. Besides the common arithmetic, logic, branch and load-store instructions it provides instructions for sophisticated logic, context saving, load-modify-store, packed arithmetic, saturated math and multiply-accumulate. The processor consists of a three-way superscalar in-order pipeline with four stages. If an address, an integer, and a loop instruction appear in this order in the instruction stream, they are issued within one cycle, even if they are data-dependent.
Simplifications for Single-Threaded CarCore
As baseline for the SMT enhancement we implemented a cycle-accurate System-C model and a synthesisable VHDL model of a Tricore-compatible processor. It differs from the original Infineon TriCore in the following aspects:
Reduced Instruction Set Special DSP instructions and addressing modes are neither generated by the Hightec [15] nor the Tasking [16] compiler, therefore they are not supported (We do not apply hand-coded assembler optimisations to the benchmarks). This reduces the number of instructions to 433. No Loop Pipeline The TriCore loop pipeline that speeds up special loops is not implemented, as the compiler can use it only in a few special cases and with multithreading the latencies can be used to execute concurrent threads. No Branch Prediction There is no branch prediction as it degrades the WCET and the delay slots can be used to execute lower priority threads. No Dedicated Context Save Memory According to the TriCore instruction set architecure, at a subroutine call, 16 registers have to be saved. To speed this up, the Tricore has a special memory area with a very wide bus for context saving. In our version we use the standard memory bus, therefore a function call is about ten times slower. Later Address Calculation CarCore calculates branch target and memory access addresses in the execute stage, one stage later than TriCore. Hence the branch and memory delay slots are increased by one, but the critical path of the very complex and slow decode stage is shortened resulting in a higher overall clock rate.
SMT Enhancements
To enable multihreading, some parts of the processor must be duplicated for every thread: the register set, the program counter and the instruction window where fetched instructions are buffered before issuing. To manage these instruction windows and to decide which instruction from which thread should be issued to which pipeline, a further pipeline stage is added, the Real-Time Issue (RTI) stage (section 4.2). Special attention demands the fetch stage (section 4.1), to guarantee that lower priority threads do not delay the hard real-time capable highest priority thread. The same reason applies to the memory controller (section 4.3) that should issue memory accesses in the same prioritised order like the real-time issue stage issues instructions. The other pipeline stages can remain unchanged, besides an additional signal that passes the thread number through the pipeline, in order that the write back stage writes to the appropriate register file. Fig. 1 shows the resulting architecture.
Instruction Fetch
The execution of an instruction typically occupies one pipeline stage for only one cycle and issuing multiple instructions from one thread is only reasonable, if there are enough instructions available.
Hence the number of instructions that is fetched per cycle must be equal or greater than the number of instructions that can be issued concurrently. As the number of concurrent instructions is equal to the number of pipelines, this number must be multiplied with the maximum instruction length to get the required fetch bandwidth.
Assuming a zero cycle memory latency, it takes two cycles from the decision, that a new fetch must be initiated (in the issue stage) until the arrival of the data at the instruction window (again in the issue stage). Therefore the instruction windows (IW) must be large enough to hold at least two times the fetch width. Each additional memory latency further increases the size of the IW by the fetch width. If the instruction width varies and the instructions are not aligned to the borders of the fetch words, the size must be further increased.
A concrete example with the CarCore architecture: There are two pipelines and an instruction can be 16 or 32 bits wide. Accordingly, fetching 64 bits should provide at least enough instructions for one cycle. If in cycle t + 0 the RTI issues two instructions to the pipelines it removes at most 64 bit from the IW and recognises that it should be refilled and initiates a fetch. During cycle t + 1 the memory is accessed and the RTI must take the next 64 bits from the IW. In cycle t + 2 the fetched data arrives at the RTI, so the data can be directly issued to the pipelines. But 128 bits are still not enough, as TriCore instructions must only be aligned to 16 bit boundaries, consequently four instructions could cover three 64 bit words and the minimum IW size is 192 bits.
With the proposed fetch width and instruction windows size optimal execution of the highest priority thread (HPT) can be guaranteed. But what about the other threads? The HPT only fully occupies the fetch stage, if there is code that uses every pipeline in every cycle and if these instructions are of maximum length. As the evaluation shows, this is almost never the case. Whenever the IW of a thread is full, the fetch logic tries to fetch for the thread with the next highest priority. Again, the evaluation shows that empty IW are only a minor reason for not executing a lower priority thread.
There are two possibilities to optimise the fetching: The first one called ENOUGH exactly counts how much instructions are in the IW and how they are mapped to pipelines. If there are enough instructions to cover two cycles, further fetches to this thread are delayed, no matter if the IW is already full or not. The AHEAD logic stops fetching when it recognizes a branch somewhere within the IW. This optimisation is only applicable if there is no branch prediction and if there are at least two pipeline stages between fetch and branch decision.
An example for the CarCore architecture with three stages between fetch and branch decision (RTI, decode and execute): If in cycle t + 0 only the branch is in the IW, the RTI issues it and removes the instruction from the IW. In cycle t + 1 the AHEAD logic recognises that there is no longer a branch in the
Algorithm 1 Policy of the Real Time Issue Stage
Input: number of threads T , number of pipelines P thread0 has the highest priority, threadT −1 the lowest Output: assignment of instructions to pipelines in pipelinep pipelinep ← ∅ ∀0 ≤ p < P
IW and permits to fetch the next instruction. In cycle t + 2 the next instruction word arrives at the RTI and is ready for issuing, while the branch instruction is now in the execute stage, calculating the branch target and checking the branch condition. In cycle t + 3 the RTI receives the signal, if the branch is taken or not and can issue the next instruction if it is not taken. The next instruction arrives even one cycle early at the RTI, but this is necessary in the CarCore architecture, as a single instruction could span two 64 bit words and then the cycle is needed for the second half of the instruction.
Both techniques save unused fetches and therefore increase the fetch bandwidth of lower priority threads without influencing the HPT performance. They are effective if the instruction length varies or only part of the pipelines are occupied within one cycle. If such extra effort is reasonable heavily depends on the complexity of the instruction encoding and the size of the hardware that is needed to decode it.
Real-Time Issue
The real-time issue (RTI) stage receives the fetched instructions from the fetch stage and inserts it into the instruction window of the appropriate thread. Then the instructions in the windows are analysed and instructions are assigned to pipelines. To decide, in which pipeline an instruction should be executed, the opcode contains a field, where the number of the appropriate pipeline is stored. Instructions that could be issued in parallel must be located in ascending order within the instruction stream. As long as the value of the pipeline field of the next instruction is higher than the former one, it can be issued concurrently. When the pipeline number of an instruction is lower or equal to the number of the preceding instruction, the latter instruction is the first instruction of the next cycle.
The assignment strictly depends on the priorities of every thread. Starting with the thread with the highest priority, the RTI tries to issue simultaneously as many instructions as possible. Then instructions from the thread with the second highest priority are issued, if the desired pipeline is not occupied yet. Algorithm 1 explains the issue strategy in pseudo code.
Additionally the RTI manages multicycle instructions. They are implemented as microcode sequences that can be interrupted at any position within the sequence. This interruptibility is important, otherwise low priority threads would delay higher priority threads for several cycles, once they were able to start their microcode sequence.
In this paper we assume that the priorities are fixed, but it is also possible to provide a new priority mapping from an external module in each cycle. With this technique, sophisticated scheduling algorithms for multiple hard real-time threads [17] or overlapping IPC controlled threads [18] can be implemented.
Prioritised Memory Controller
The easiest way to deal with memory accesses is to add a memory stage in the pipeline between execute and write-back stage, as it is implemented in the classical DLX pipeline [19] . But then there might be a read-after-write data dependency after a load instruction that cannot be solved by forwarding. Depending on the instruction set, the check if there is really a dependency can be difficult (for TriCore it is), therefore stalling the thread for one cycle anyway (and using the cycle for a lower priority thread) is an acceptable solution. After a store no bubble cycle must be inserted.
In TriCore the bubble cycle is avoided by calculating the address in the decode stage and accessing memory in the execute stage, but this cannot be applied here, as CarCore calculates the address in the execute stage (to achieve better stage balance, see section 3.2). If a memory access takes multiple cycles, say it has a latency of M , single threaded processors stall the complete pipeline until the access is completed, but this cannot be applied here, as this would prevent all other threads form being executed, even the highest priority thread.
A straight enhancement of the memory stage idea would be to add M memory stages (plus the one memory stage mentioned earlier). To avoid data dependencies, the thread must be stalled for M + 1 before the next instruction of the same thread might be issued. But there is one more problem: if two memory instructions of different threads are issued in successive cycles the second memory instruction will arrive at the memory controller when it is busy because of the first instruction. Consequently, after the RTI issued a memory operation, no other memory operation from any thread may be issued for M cycles.
To save the enormous hardware costs of multiple memory stages, we applied a technique called Split Phase Load. For a memory write the additional memory stages are not needed, as nothing must be written back, only the stalling of the threads is important. Hence only a solution for loads must be found: the load is split into two instructions, the address calculation and the register write back. When the RTI recognises a load instruction, it issues the first half of the instruction that calculates the address in the execute stage and forwards it to the memory controller. When the memory controller receives the data it notifies the RTI and it issues the second half of the instruction that writes the data from the memory to the register set in the write back stage. The notification can even be sent some cycles earlier in order that the instruction arrives at the write back stage at the same cycle as the data arrives from memory.
To avoid the restriction of the other threads, not to issue memory operations, Address Buffers are added. There is one address buffer per thread located in the memory controller. After a store or the first half of a load the thread is temporarily suspended from further issuing instructions. When the memory instruction arrives at the memory controller the address is saved in the address buffer of the appropriate thread. Whenever a memory operation is completed, the memory controller looks at the address buffers in priority order and starts a memory operation if there is a valid entry. At the same cycle the memory controller notifies the RTI to resume issuing instructions of the thread whose data word had just arrived. Depending on the kind of instruction the RTI continues with the second half of a load instruction or the next instruction after a store.
There is another advantage of the Split Phase Load / Address Buffer technique: the memory latency can vary and there is no upper bound. If the memory access is fast, the second half of the load is issued earlier, if it takes longer, the write back instruction (respectively the next instruction after a store) is issued later. So multiple memories with different access times are supported.
Evaluation
We started our SMT enhancement with a singlethreaded SystemC model of the CarCore and enhanced it to support multithreading. The final SystemC model was translated to VHDL for FPGA synthesis. There are separate data and instruction memory buses, each 64 bits wide. In the FPGA model the memory latencies are fixed to 0 for the instruction memory (internal on-chip RAM) and 4 cycles to the off-chip data memory. We used the Hightec GNU C/C++ compiler for TriCore [15] to compile benchmark programs from the EEMBC AutoBech 1.1 benchmark suite [20] (a2time, canrdr, aifirf, rspeed) and the Mälardalen WCET group [21] (crc, fft1, mm). When executing multiple threads, the HPT reaches exactly 100% of its singlethreaded speed, hence a WCET analysis for a singlethreaded simplification of our architecture is also valid for the HPT in the multithreaded architecture [17] . The speed of the threads with lower priorities fall exponentially to about 50, 35 and 20 percent of single threaded performance (measured in Instructions Per Cycle, IPC). For a more detailed discussion see [17] .
Reasons for stalling threads
The reasons why a thread cannot issue any instructions in a certain cycle can be divided into five classes:
branch Fixed latency of a branch: 2 cycles memfix Minimum latency of a memory access: 3 cycles membusy Additional stall cycles, when a memory instruction cannot be executed, because the memory is busy with an operation from another thread. pipeline The desired pipeline is already occupied by a higher priority thread. fetch The instruction window is empty. Fig. 3 shows the distribution of the reasons for not issuing instructions of the highest priority thread (HPT), depending on the memory latencies. The x-axis gives the reason for a delay and the numbers on the x-axis indicate the memory latencies: the first number is the latency of the instruction memory, the second one the latency for data memory. Even with minimum latencies (0/0), more than 80% of unused cycles are due to memory delays. Not surprisingly the percentage is increased when the latencies are increased. Each additional cycle of instruction memory latency increases the percentage of fetch stalles by 8%, therefore a fast instruction connection (via scratchpad or instruction cache) is inevitable. The bars marked with plus show additional membusy latencies. These appear, when the HPT is executed together with other threads (for the bars without a plus, only a single thread was executed). If a lower priority memory access takes multiple cycles and begins in the cycle preceding the cycle when a HPT memory access should start, the former occupies the memory controller for multiple cycles and therefore delays the HPT thread.
This effect violates the complete isolation (and thus the hard real-time capability of the HPT), but it can (i) either be avoided by delaying lower priority memory accesses when a HPT memory access is on the way through the pipeline (then the reason distribution is the same as in the corresponding case without the plus) or (ii) the WCET analysis can be modified to assume twice the memory latency. Both solutions are discussed in detail in [17] .
We executed 1000 task-sets of 8 threads that were randomly assembled from the above mentioned benchmarks. Fig. 4 shows the average distribution of stall reasons against the priority. The fixed latencies (which dominate the stall reasons of the HPT) are less important for lower priority threads. For these threads, the influence of fetch and pipeline conflicts grows significantly. If the memory latency is more than zero (rightmost group of Fig. 4 ), this stall reason dominates for lower priority threads.
Instruction Fetch Optimisation
To decrease the fetch conflicts and hece increase the performance of lower priority threads, we introduced the AHEAD and the ENOUGH fetch policies. They reduce the number of fetches of the HPT without affecting its real-time behaviour. Fig. 5 shows the percentage of fetch cycles executing the benchmarks singlethreaded on the CarCore. The percentages vary significantly depending on the benchmark. Both optimised policies reduce the number of fetches by about 5 to 10 percent and as the average of the benchmarks shows, ENOUGH is on average better than AHEAD, but not for all. Very interesting is the combination of both policies: their sets of eliminated fetches are nearly distinct, hence it is not surprising that the numbers of eliminated fetches could nearly be added if combining both. But for some benchmarks their cooperation is even better: because of a complicated interaction (the explanation goes beyond the scope of this paper) the savings of the combination is even bigger than the sum of the single savings.
Conclusion and Future Work
We explained how a singlethreaded superscalar TriCor compatible processor can be enhanced to provide SMT while still allowing the execution of one hard realtime thread with several non real-time threads concurrently in the background. The techniques described can easily be transfered to any other superscalar inorder processor. The latency of the memory is the main reason for stalling threads and thus the biggest problem of the architecture. Currently our group is integrating scratchpad memory into the CarCore, to ease this problem. First results are available in [22] . The split phase load is not optimal for small latencies, as it has a fixed minimum latency (3 cycles in CarCore), therefore the performance might be pushed by a Speculative Split Phase: if the load accesses the scratchpad memory it is handled like a one cycle instruction and the data is written to the register file directly in the write-back stage. Else (if the access is to the slow memory) the thread is stalled until the second phase of the load instructions writes the data to the register file.
