734 research outputs found

    A low-complexity, high-performance fetch unit for simultaneous multithreading processors

    Get PDF
    Simultaneous multithreading (SMT) is an architectural technique that allows for the parallel execution of several threads simultaneously. Fetch performance has been identified as the most important bottleneck for SMT processors. The commonly adopted solution has been fetching from more than one thread each cycle. Recent studies have proposed a plethora of fetch policies to deal with fetch priority among threads, trying to increase fetch performance. We demonstrate that the simultaneous sharing of the fetch unit, apart from increasing the complexity of the fetch unit, can be counterproductive in terms of performance. We evaluate the use of high-performance fetch units in the context of SMT. Our new fetch architecture proposal allows us to feed an 8-way processor fetching from a single thread each cycle, reducing complexity, and increasing the usefulness of proposed fetch policies. Our results show that using new high-performance fetch units, like the FTB or the stream fetch, provides higher performance than fetching from two threads using common SMT fetch architectures. Furthermore, our results show that our design obtains better average performance for any kind of workloads (both ILP and memory bounded benchmarks), in contrast to previously proposed solutions.Peer ReviewedPostprint (published version

    Improving latency tolerance of multithreading through decoupling

    Get PDF
    The increasing hardware complexity of dynamically scheduled superscalar processors may compromise the scalability of this organization to make an efficient use of future increases in transistor budget. SMT processors, designed over a superscalar core, are therefore directly concerned by this problem. The article presents and evaluates a novel processor microarchitecture which combines two paradigms: simultaneous multithreading and access/execute decoupling. Since its decoupled units issue instructions in order, this architecture is significantly less complex, in terms of critical path delays, than a centralized out-of-order design, and it is more effective for future growth in issue-width and clock speed. We investigate how both techniques complement each other. Since decoupling features an excellent memory latency hiding efficiency, the large amount of parallelism exploited by multithreading may be used to hide the latency of functional units and keep them fully utilized. The study shows that, by adding decoupling to a multithreaded architecture, fewer threads are needed to achieve maximum throughput. Therefore, in addition to the obvious hardware complexity reduction, it places lower demands on the memory system. The study also reveals that multithreading by itself exhibits little memory latency tolerance. Results suggest that most of the latency hiding effectiveness of SMT architectures comes from the dynamic scheduling. On the other hand, decoupling is very effective at hiding memory latency. An increase in the cache miss penalty from 1 to 32 cycles reduces the performance of a 4-context multithreaded decoupled processor by less than 2 percent. For the nondecoupled multithreaded processor, the loss of performance is about 23 percent.Peer ReviewedPostprint (published version

    The synergy of multithreading and access/execute decoupling

    Get PDF
    This work presents and evaluates a novel processor microarchitecture which combines two paradigms: access/execute decoupling and simultaneous multithreading. We investigate how both techniques complement each other: while decoupling features an excellent memory latency hiding efficiency, multithreading supplies the in-order issue stage with enough ILP to hide the functional unit latencies. Its partitioned layout, together with its in-order issue policy makes it potentially less complex, in terms of critical path delays, than a centralized out-of-order design, to support future growths in issue-width and clock speed. The simulations show that by adding decoupling to a multithreaded architecture, its miss latency tolerance is sharply increased and in addition, it needs fewer threads to achieve maximum throughput, especially for a large miss latency. Fewer threads result in a hardware complexity reduction and lower demands on the memory system, which becomes a critical resource for large miss latencies, since bandwidth may become a bottleneck.Peer ReviewedPostprint (published version

    Efficient resources assignment schemes for clustered multithreaded processors

    Get PDF
    New feature sizes provide larger number of transistors per chip that architects could use in order to further exploit instruction level parallelism. However, these technologies bring also new challenges that complicate conventional monolithic processor designs. On the one hand, exploiting instruction level parallelism is leading us to diminishing returns and therefore exploiting other sources of parallelism like thread level parallelism is needed in order to keep raising performance with a reasonable hardware complexity. On the other hand, clustering architectures have been widely studied in order to reduce the inherent complexity of current monolithic processors. This paper studies the synergies and trade-offs between two concepts, clustering and simultaneous multithreading (SMT), in order to understand the reasons why conventional SMT resource assignment schemes are not so effective in clustered processors. These trade-offs are used to propose a novel resource assignment scheme that gets and average speed up of 17.6% versus Icount improving fairness in 24%.Peer ReviewedPostprint (published version

    Improving multithreading performance for clustered VLIW architectures.

    Get PDF
    Very Long Instruction Word (VLIW) processors are very popular in embedded and mobile computing domain. Use of VLIW processors range from Digital Signal Processors (DSPs) found in a plethora of communication and multimedia devices to Graphics Processing Units (GPUs) used in gaming and high performance computing devices. The advantage of VLIWs is their low complexity and low power design which enable high performance at a low cost. Scalability of VLIWs is limited by the scalability of register file ports. It is not viable to have a VLIW processor with a single large register file because of area and power consumption implications of the register file. Clustered VLIW solve the register file scalability issue by partitioning the register file into multiple clusters and a set of functional units that are attached to register file of that cluster. Using a clustered approach, higher issue width can be achieved while keeping the cost of register file within reasonable limits. Several commercial VLIW processors have been designed using the clustered VLIW model. VLIW processors can be used to run a larger set of applications. Many of these applications have a good Lnstruction Level Parallelism (ILP) which can be efficiently utilized. However, several applications, specially the ones that are control code dominated do not exibit good ILP and the processor is underutilized. Cache misses is another major source of resource underutiliztion. Multithreading is a popular technique to improve processor utilization. Interleaved MultiThreading (IMT) hides cache miss latencies by scheduling a different thread each cycle but cannot hide unused instructions slots. Simultaneous MultiThread (SMT) can also remove ILP under-utilization by issuing multiple threads to fill the empty instruction slots. However, SMT has a higher implementation cost than IMT. The thesis presents Cluster-level Simultaneous MultiThreading (CSMT) that supports a limited form of SMT where VLIW instructions from different threads are merged at a cluster-level granularity. This lowers the hardware implementation cost to a level comparable to the cheap IMT technique. The more complex SMT combines VLIW instructions at the individual operation-level granularity which is quite expensive especially in for a mobile solution. We refer to SMT at operation-level as OpSMT to reduce ambiguity. While previous studies restricted OpSMT on a VLIW to 2 threads, CSMT has a better scalability and upto 8 threads can be supported at a reasonable cost. The thesis proposes several other techniques to further improve CSMT performance. In particular, Cluster renaming remaps the clusters used by instructions of different threads to reduce resource conflicts. Cluster renaming is quite effective in reducing the issue-slots under-utilization and significantly improves CSMT performance.The thesis also proposes: a hybrid between IMT and CSMT which increases the number of supported threads, heterogeneous instruction merging where some instructions are combined using SMT and CSMT rest, and finally, split-issue, a technique that allows to launch partially an instruction making it easier to be combined with others

    Out-of-Order Retirement of Instructions in Superscalar, Multithreaded, and Multicore Processors

    Full text link
    Los procesadores superescalares actuales utilizan un reorder buffer (ROB) para contabilizar las instrucciones en vuelo. El ROB se implementa como una cola FIFO first in first out en la que las instrucciones se insertan en orden de programa después de ser decodificadas, y de la que se extraen también en orden de programa en la etapa commit. El uso de esta estructura proporciona un soporte simple para la especulación, las excepciones precisas y la reclamación de registros. Sin embargo, el hecho de retirar instrucciones en orden puede degradar las prestaciones si una operación de alta latencia está bloqueando la cabecera del ROB. Varias propuestas se han publicado atacando este problema. La mayoría utiliza retirada de instrucciones fuera de orden de forma especulativa, requiriendo almacenar puntos de recuperación (checkpoints) para restaurar un estado válido del procesador ante un fallo de especulación. Normalmente, los checkpoints necesitan implementarse con estructuras hardware costosas, y además requieren un crecimiento de otras estructuras del procesador, lo cual a su vez puede impactar en el tiempo de ciclo de reloj. Este problema afecta a muchos tipos de procesadores actuales, independientemente del número de hilos hardware (threads) y del número de núcleos de cómputo (cores) que incluyan. Esta tesis abarca el estudio de la retirada no especulativa de instrucciones fuera de orden en procesadores superescalares, multithread y multicore.Ubal Tena, R. (2010). Out-of-Order Retirement of Instructions in Superscalar, Multithreaded, and Multicore Processors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8535Palanci

    Designing with RoBs for High Performance VLIW Architecture

    Get PDF
    VLIW architecture has become widespread due to the combined bene?ts of simple hardware and compiler extracted instruction level parallelism. The VLIW instruction set architecture and its hardware implementation is tightly coupled and a novel simultaneous multithreading VLIW architecture with dynamic dispatch mechanism which uses RoBs complex logic to maximize ILP has been proposed. Since the resulting dynamic instruction schedule of many applications seldom changes, it is reasonable to store and reuse the schedule instead of reconstructing it each time. The new VLIW architecture shows that it can effectively increase the processor efficiency which improves the performance

    An Investigation of thread scheduling heuristics for a simultaneous multithreaded processor

    Get PDF
    Over the years, the von Neumann model of computing has undergone many enhancements. These changes include an improved memory hierarchy, multiple instruction issue and branch predic tion. Since the model\u27s introduction, the performance of processors has increased at a much greater rate than that of memory. Several modifications to hide this ever widening gap in performance are being examined in current research. A very promising one is the Simultaneous Multithreaded processor. This architecture strives to further reduce the effects of long latency instructions, such as memory accesses, by allowing multiple threads of execution to be active in the processor at the same time. With the introduction of multiple active threads in a single processor, several new aspects of processor operation can have a sizeable effect on performance. One such aspect is how to choose from which thread to fetch instructions during the next cycle. For this project, three different classes of fetch scheduling mechanisms were defined and exam ples of each were either studied or proposed. The proposed mechanisms were then tested using a set of four sample programs by adding the mechanisms to a Simultaneous Multithreading sim ulator based on the Simple Scalar tool set from the University of Wisconsin-Madison. With the proper configuration, each of the proposed mechanisms improved the performance of the simulated architecture. However, the best increase in performance was produced by the Event History Table. It achieved an IPC of 2.0995 for two threads while overriding the primary scheduling mechanism only 0.070% of the time
    corecore