1,193 research outputs found

    Early register release for out-of-order processors with register windows

    Get PDF
    Register windows is an architectural technique that reduces memory operations required to save and restore registers across procedure calls. Its effectiveness depends on the size of the register file. Such register requirements are normally increased for out-of-order execution because it requires registers for the in-flight instructions, in addition to the architectural ones. However, a large register file has an important cost in terms of area and power and may even affect the cycle time. In this paper we propose two early register release techniques that leverages register windows to drastically reduce the register requirements, and hence reduce the register file cost. Contrary to the common belief that out-of-order processors with register windows would need a large physical register file, this paper shows that the physical register file size may be reduced to the bare minimum by using this novel microarchitecture. Moreover, our proposal has much lower hardware complexity than previous approaches, and requires minimal changes to a conventional register window scheme. Performance studies show that the proposed technique can reduce the number of physical registers to the same number as logical registers plus one (minimum number to guarantee forward progress) and still achieve almost the same performance as an unbounded register file.Peer ReviewedPostprint (published version

    Efficient resources assignment schemes for clustered multithreaded processors

    Get PDF
    New feature sizes provide larger number of transistors per chip that architects could use in order to further exploit instruction level parallelism. However, these technologies bring also new challenges that complicate conventional monolithic processor designs. On the one hand, exploiting instruction level parallelism is leading us to diminishing returns and therefore exploiting other sources of parallelism like thread level parallelism is needed in order to keep raising performance with a reasonable hardware complexity. On the other hand, clustering architectures have been widely studied in order to reduce the inherent complexity of current monolithic processors. This paper studies the synergies and trade-offs between two concepts, clustering and simultaneous multithreading (SMT), in order to understand the reasons why conventional SMT resource assignment schemes are not so effective in clustered processors. These trade-offs are used to propose a novel resource assignment scheme that gets and average speed up of 17.6% versus Icount improving fairness in 24%.Peer ReviewedPostprint (published version

    Virtual-physical registers

    Get PDF
    A novel dynamic register renaming approach is proposed in this work. The key idea of the novel scheme is to delay the allocation of physical registers until a late stage in the pipeline, instead of doing it in the decode stage as conventional schemes do. In this way, the register pressure is reduced and the processor can exploit more instruction-level parallelism. Delaying the allocation of physical registers require some additional artifact to keep track of dependences. This is achieved by introducing the concept of virtual-physical registers, which do not require any storage location and are used to identify dependences among instructions that have not yet allocated a register to its destination operand. Two alternative allocation strategies have been investigated that differ in the stage where physical registers are allocated: issue or write-back. The experimental evaluation has confirmed the higher performance of the latter alternative. We have performed all evaluation of the novel scheme through a detailed simulation of a dynamically scheduled processor. The results show a significant improvement (e.g., 19% increase in IPC for a machine with 64 physical registers in each file) when compared with the traditional register renaming approach.Peer ReviewedPostprint (published version

    Leveraging register windows to reduce physical registers to the bare minimum

    Get PDF
    Register window is an architectural technique that reduces memory operations required to save and restore registers across procedure calls. Its effectiveness depends on the size of the register file. Such register requirements are normally increased for out-of-order execution because it requires registers for the in-flight instructions, in addition to the architectural ones. However, a large register file has an important cost in terms of area and power and may even affect the cycle time. In this paper, we propose a software/hardware early register release technique that leverage register windows to drastically reduce the register requirements, and hence, reduce the register file cost. Contrary to the common belief that out-of-order processors with register windows would need a large physical register file, this paper shows that the physical register file size may be reduced to the bare minimum by using this novel microarchitecture. Moreover, our proposal has much lower hardware complexity than previous approaches, and requires minimal changes to a conventional register window scheme. Performance studies show that the proposed technique can reduce the number of physical registers to the number of logical registers plus one (minimum number to guarantee forward progress) and still achieve almost the same performance as an unbounded register file.Peer ReviewedPostprint (published version

    A software-hardware hybrid steering mechanism for clustered microarchitectures

    Get PDF
    Clustered microarchitectures provide a promising paradigm to solve or alleviate the problems of increasing microprocessor complexity and wire delays. High- performance out-of-order processors rely on hardware-only steering mechanisms to achieve balanced workload distribution among clusters. However, the additional steering logic results in a significant increase on complexity, which actually decreases the benefits of the clustered design. In this paper, we address this complexity issue and present a novel software-hardware hybrid steering mechanism for out-of-order processors. The proposed software- hardware cooperative scheme makes use of the concept of virtual clusters. Instructions are distributed to virtual clusters at compile time using static properties of the program such as data dependences. Then, at runtime, virtual clusters are mapped into physical clusters by considering workload information. Experiments using SPEC CPU2000 benchmarks show that our hybrid approach can achieve almost the same performance as a state-of-the-art hardware-only steering scheme, while requiring low hardware complexity. In addition, the proposed mechanism outperforms state-of-the-art software-only steering mechanisms by 5% and 10% on average for 2-cluster and 4-cluster machines, respectively.Peer ReviewedPostprint (published version

    Improving latency tolerance of multithreading through decoupling

    Get PDF
    The increasing hardware complexity of dynamically scheduled superscalar processors may compromise the scalability of this organization to make an efficient use of future increases in transistor budget. SMT processors, designed over a superscalar core, are therefore directly concerned by this problem. The article presents and evaluates a novel processor microarchitecture which combines two paradigms: simultaneous multithreading and access/execute decoupling. Since its decoupled units issue instructions in order, this architecture is significantly less complex, in terms of critical path delays, than a centralized out-of-order design, and it is more effective for future growth in issue-width and clock speed. We investigate how both techniques complement each other. Since decoupling features an excellent memory latency hiding efficiency, the large amount of parallelism exploited by multithreading may be used to hide the latency of functional units and keep them fully utilized. The study shows that, by adding decoupling to a multithreaded architecture, fewer threads are needed to achieve maximum throughput. Therefore, in addition to the obvious hardware complexity reduction, it places lower demands on the memory system. The study also reveals that multithreading by itself exhibits little memory latency tolerance. Results suggest that most of the latency hiding effectiveness of SMT architectures comes from the dynamic scheduling. On the other hand, decoupling is very effective at hiding memory latency. An increase in the cache miss penalty from 1 to 32 cycles reduces the performance of a 4-context multithreaded decoupled processor by less than 2 percent. For the nondecoupled multithreaded processor, the loss of performance is about 23 percent.Peer ReviewedPostprint (published version

    MOWER : A NEW DESIGN FOR NON-BLOCKING MISPREDICTION RECOVERY

    Get PDF
    Mower is a micro-architecture technique which targets branch misprediction penalties in superscalar processors. It speeds-up the misprediction recovery process by dynamically evicting stale instructions and fixing the RAT (Register Alias Table) using explicit branch dependency tracking. Tracking branch dependencies is accomplished by using simple bit matrices. This low-overhead technique allows overlapping of the recovery process with instruction fetching, renaming and scheduling from the correct path. Our evaluation of the mechanism indicates that it yields performance very close to ideal recovery and provides up to 5% speed-up and 2% reduction in power consumption compared to a traditional recovery mechanism using a reorder buffer and a walker. The simplicity of the mechanism should permit easy implementation of Mower in an actual processor

    Dynamic Dependency Collapsing

    Get PDF
    In this dissertation, we explore the concept of dynamic dependency collapsing. Performance increases in computer architecture are always introduced by exploiting additional parallelism when the clock speed is fixed. We show that further improvements are possible even when the available parallelism in programs are exhausted. This performance improvement is possible due to executing instructions in parallel that would ordinarily have been serialized. We call this concept dependency collapsing. We explore existing techniques that exploit parallelism and show which of them fall under the umbrella of dependency collapsing. We then introduce two dependency collapsing techniques of our own. The first technique collapses data dependencies by executing two normally dependent instructions together by fusing them. We show that exploiting the additional parallelism generated by collapsing these dependencies results in a performance increase. Our second technique collapses resource dependencies to execute instructions that would normally have been serialized due to resource constraints in the processor. We show that it is possible to take advantage of larger in-processor structures while avoiding the power and area penalty this often implies

    Adaptable register file organization for vector processors

    Get PDF
    Today there are two main vector processors design trends. On the one hand, we have vector processors designed for long vectors lengths such as the SX-Aurora TSUBASA which implements vector lengths of 256 elements (16384-bits). On the other hand, we have vector processors designed for short vectors such as the Fujitsu A64FX that implements vector lengths of 8 elements (512-bit) ARM SVE. However, short vector designs are the most widely adopted in modern chips. This is because, to achieve high-performance with a very high-efficiency, applications executed on long vector designs must feature abundant DLP, then limiting the range of applications. On the contrary, short vector designs are compatible with a larger range of applications. In fact, in the beginnings, long vector length implementations were focused on the HPC market, while short vector length implementations were conceived to improve performance in multimedia tasks. However, those short vector length extensions have evolved to better fit the needs of modern applications. In that sense, we believe that this compatibility with a large range of applications featuring high, medium and low DLP is one of the main reasons behind the trend of building parallel machines with short vectors. Short vector designs are area efficient and are "compatible" with applications having long vectors; however, these short vector architectures are not as efficient as longer vector designs when executing high DLP code. In this thesis, we propose a novel vector architecture that combines the area and resource efficiency characterizing short vector processors with the ability to handle large DLP applications, as allowed in long vector architectures. In this context, we present AVA, an Adaptable Vector Architecture designed for short vectors (MVL = 16 elements), capable of reconfiguring the MVL when executing applications with abundant DLP, achieving performance comparable to designs for long vectors. The design is based on three complementary concepts. First, a two-stage renaming unit based on a new type of registers termed as Virtual Vector Registers (VVRs), which are an intermediate mapping between the conventional logical and the physical and memory registers. In the first stage, logical registers are renamed to VVRs, while in the second stage, VVRs are renamed to physical registers. Second, a two-level VRF, that supports 64 VVRs whose MVL can be configured from 16 to 128 elements. The first level corresponds to the VVRs mapped in the physical registers held in the 8KB Physical Vector Register File (P-VRF), while the second level represents the VVRs mapped in memory registers held in the Memory Vector Register File (M-VRF). While the baseline configuration (MVL=16 elements) holds all the VVRs in the P-VRF, larger MVL configurations hold a subset of the total VVRs in the P-VRF, and map the remaining part in the M-VRF. Third, we propose a novel two-stage vector issue unit. In the first stage, the second level of mapping between the VVRs and physical registers is performed, while issuing to execute is managed in the second stage. This thesis also presents a set of tools for designing and evaluating vector architectures. First, a parameterizable vector architecture model implemented on the gem5 simulator to evaluate novel ideas on vector architectures. Second, a Vector Architecture model implemented on the McPAT framework to evaluate power and area metrics. Finally, the RiVEC benchmark suite, a collection of ten vectorized applications from different domains focusing on benchmarking vector microarchitectures.Hoy en día existen dos tendencias principales en el diseño de procesadores vectoriales. Por un lado, tenemos procesadores vectoriales basados en vectores largos como el SX-Aurora TSUBASA que implementa vectores con 256 elementos (16384-bits) de longitud. Por otro lado, tenemos procesadores vectoriales basados en vectores cortos como el Fujitsu A64FX que implementa vectores de 8 elementos (512-bits) de longitud ARM SVE. Sin embargo, los diseños de vectores cortos son los más adoptados en los chips modernos. Esto es porque, para lograr alto rendimiento con muy alta eficiencia, las aplicaciones ejecutadas en diseños de vectores largos deben presentar abundante paralelismo a nivel de datos (DLP), lo que limita el rango de aplicaciones. Por el contrario, los diseños de vectores cortos son compatibles con un rango más amplio de aplicaciones. En sus orígenes, implementaciones basadas en vectores largos estaban enfocadas al HPC, mientras que las implementaciones basadas en vectores cortos estaban enfocadas en tareas de multimedia. Sin embargo, esas extensiones basadas en vectores cortos han evolucionado para adaptarse mejor a las necesidades de las aplicaciones modernas. Creemos que esta compatibilidad con un mayor rango de aplicaciones es una de las principales razones de construir máquinas paralelas basadas en vectores cortos. Los diseños de vectores cortos son eficientes en área y son compatibles con aplicaciones que soportan vectores largos; sin embargo, estos diseños de vectores cortos no son tan eficientes como los diseños de vectores largos cuando se ejecuta un código con alto DLP. En esta tesis, proponemos una novedosa arquitectura vectorial que combina la eficiencia de área y recursos que caracteriza a los procesadores vectoriales basados en vectores cortos, con la capacidad de mejorar en rendimiento cuando se presentan aplicaciones con alto DLP, como lo permiten las arquitecturas vectoriales basadas en vectores largos. En este contexto, presentamos AVA, una Arquitectura Vectorial Adaptable basada en vectores cortos (MVL = 16 elementos), capaz de reconfigurar el MVL al ejecutar aplicaciones con abundante DLP, logrando un rendimiento comparable a diseños basados en vectores largos. El diseño se basa en tres conceptos. Primero, una unidad de renombrado de dos etapas basada en un nuevo tipo de registros denominados registros vectoriales virtuales (VVR), que son un mapeo intermedio entre los registros lógicos y físicos y de memoria. En la primera etapa, los registros lógicos se renombran a VVR, mientras que, en la segunda etapa, los VVR se renombran a registros físicos. En segundo lugar, un VRF de dos niveles, que admite 64 VVR cuyo MVL se puede configurar de 16 a 128 elementos. El primer nivel corresponde a los VVR mapeados en los registros físicos contenidos en el banco de registros vectoriales físico (P-VRF) de 8 KB, mientras que el segundo nivel representa los VVR mapeados en los registros de memoria contenidos en el banco de registros vectoriales de memoria (M-VRF). Mientras que la configuración de referencia (MVL=16 elementos) contiene todos los VVR en el P-VRF, las configuraciones de MVL más largos contienen un subconjunto del total de VVR en el P-VRF y mapean la parte restante en el M-VRF. En tercer lugar, proponemos una novedosa unidad de colas de emisión de dos etapas. En la primera etapa se realiza el segundo nivel de mapeo entre los VVR y los registros físicos, mientras que en la segunda etapa se gestiona la emisión de instrucciones a ejecutar. Esta tesis también presenta un conjunto de herramientas para diseñar y evaluar arquitecturas vectoriales. Primero, un modelo de arquitectura vectorial parametrizable implementado en el simulador gem5 para evaluar novedosas ideas. Segundo, un modelo de arquitectura vectorial implementado en McPAT para evaluar las métricas de potencia y área. Finalmente, presentamos RiVEC, una colección de diez aplicaciones vectorizadas enfocadas en evaluar arquitecturas vectorialesPostprint (published version
    corecore