9 research outputs found

    Design of Clustered Superscalar Microarchitectures

    Get PDF
    L'objectiu d'aquesta tesi és proposar noves tècniques per al disseny de microarquitectures clúster superescalars eficients. Les microarquitectures clúster particionen el disseny de diversos components crítics del hardware com a mitjà per mantenir-ne el paral·lelisme i millorar-ne l'escalabilitat. El nucli d'un processador clúster, format per blocs de baixa complexitat o clústers, pot executar cadenes d'instruccions dependents sense pagar el sobrecost d'una llarga emissió, curtcircuïts, o lectura de registres, encara que si dues instruccions dependents s'executen en clústers diferents, es paga la penalització d'una comunicació. Per altra banda, les estructures distribuïdes impliquen generalment menors requisits de potència dinàmica, i simplifiquen la gestió de l'energia per mitjà de tècniques com la desactivació selectiva del rellotge o l'energia, o com la reducció a escalade la tensió.El primer objecte d'aquesta recerca és l'assignació d'instruccions a clústers, ja que aquesta juga un paper clau en el rendiment, amb l'objectiu de mantenir equilibrada la càrrega i reduir la penalització de les comunicacions crítiques. Es proposen dos diferents enfocs: primer, una família de nous esquemes que identifiquen dinàmicament certs grups d'instruccions dependents anomenats "slices", i fan l'assignació de clústers slice per slice. Es diferencien d'altres enfocs previs, ja sigui perquè són dinàmics i/o bé perquè inclouen nous mecanismes explícits de mesura i gestió de l'equilibri de càrrega. Segon, una família de nous esquemes que assignen clústers instrucció per instrucció, basats en les assignacions prèvies dels productors dels registres fonts, en la ubicació dels registres físics, i en la càrrega de treball.La segona contribució proposa la predicció de valors com a mitjà per mitigar les penalitzacions dels retards dels connectors i, en particular, per amagar les comunicacions entre clústers. Es demostra que el benefici obtingut amb l'eliminació de dependències creix amb el nombre de clústers i la latència de les comunicacions i, doncs, és major que per a una arquitectura centralitzada. Es proposa un nou esquema d'assignació de clústers que aprofita la menor densitat del graf de dependències per tal de millorar l'equilibri de càrrega.El tercer aspecte considerat es la xarxa d'interconnexió entre clústers, ja que determina la latència de les comunicacions, amb l'objectiu de trobar el millor compromís entre cost i rendiment. Es proposen diverses xarxes punt-a-punt, tant síncrones com parcialment asíncrones, que assoleixen un IPC pròxim al d'un model ideal amb ample de banda il·limitat, tot i tenir molt baixa complexitat. Llur impacte sobre els curtcircuïts, cues d'emissió o bancs de registres es molt menor que el d'altres enfocs. Es proposen també possibles implementacions dels enrutadors, que il·lustren llur factibilitat amb solucions hardware molt simples i de baixa latència. Es proposa un nou esquema d'assignació de clústers conscient de la topologia, que redueix la latència de les comunicacions.L'última contribució proposa tècniques per distribuir els components principals de les etapes inicials del processador, amb l'objectiu de reduir-ne la complexitat i evitar-ne la replicació. Es proposen tècniques eficaces per a la partició del predictor de salts i la lògica de distribució d'instruccions, a fi de minimitzar la penalització pels retards dels connectors causada per les dependències recursives en dos llaços crítics del hardware: la generació de l'adreça de búsqueda d'instruccions i la lògica d'assignació d'instruccions, respectivament. En el primer cas, es converteixen els retards dels connectors intra-estructurals d'un predictor centralitzat en retards de comunicació entre clústers, els quals se segmenten sense problemes. En el segon cas, el particionat de la lògica d'assignació d'instruccions basada en dependències implica paral·lelitzar aquesta tasca, la qual es inherentment seqüencial.El objetivo de esta tesis es proponer técnicas para el diseño de microarquitecturas clúster superescalares eficientes. Las microarquitecturas clúster particionan el diseño de diversos componentes críticos del hardware como medio para mantener el paralelismo y mejorar la escalabilidad. El núcleo de un procesador clúster, formado por bloques de baja complejidad o clústers, puede ejectutar cadenas de instrucciones dependientes sin pagar el sobrecoste de una larga emisión, cortocircuitos, o lectura de registros; pero si dos instrucciones dependientes se ejecutan en clústers distintos, se paga la penalización de una comunicación. Por otro lado, las estructuras distribuidas implican generalmente menores requisitos de potencia dinámica, y simplifican la gestión de la energía por medio de técnicas como la desactivación selectiva del reloj o de la alimentación, o la reducción a escala del voltaje.El primer objetivo de esta investigación es la asignación dinámica de instrucciones a clústers, ya que ésta juega un papel clave en el rendimiento, a fin de mantener equilibrada la carga y reducir la penalización de las comunicaciones críticas. Se proponen dos enfoques distintos: primero, una familia de nuevos esquemas que identifican dinámicamente ciertos grupos de instrucciones denominados "slices", y realizan la asignación slice por slice. Éstos se diferencian de otros enfoques previos, ya sea porque son dinámicos y/o porque incluyen nuevos mecanismos explícitos de medida y gestión del equilibrio de carga. Segundo, una familia de nuevos esquemas que asignan clústers instrucción a instrucción, basándose en las asignaciones previas de los productores de sus registros fuente, en la ubicación de los registros físicos, y en la carga de trabajo.La segunda contribución propone la predicción de valores como medio para mitigar las penalizaciones de los retardos de los conectores, y en particular, para esconder las comunicaciones entre clústers. Se demuestra que el beneficio obtenido con la eliminación de dependencias aumenta con el número de clústers y con la latencia de las comunicaciones, y es asimismo mayor que para una arquitectura centralizada. Se propone un nuevo esquema de asignación de clústers que aprovecha la menor densidad del grafo de dependencias con el fin de mejorar el equilibrio de la carga.El tercer aspecto considerado es la red de interconexión entre clústers, pues determina la latencia de las comunicaciones, a fin de hallar el mejor compromiso entre coste y rendimiento. Se proponen diversas redes punto a punto, tanto síncronas como parcialmente asíncronas, que aun teniendo muy baja complejidad consiguen un IPC próximo al de un modelo con ancho de banda ilimitado. Su impacto sobre la complejidad de los cortocircuitos, colas de emisión o bancos de registros es mucho menor que el de otros enfoques. Se proponen también posibles implementaciones de los enrutadores, ilustrando su factibilidad como soluciones simples y de baja latencia. Se propone un esquema de asignación de clústers consciente de la topología, que reduce la latencia de las comunicaciones.La última contribución propone técnicas para distribuir los componentes principales de las etapas iniciales del procesador, con el objetivo de reducir su complejidad y evitar su replicación. Se proponen técnicas eficaces para particionar el predictor de saltos y la lógica de distribución de instrucciones, a fin de minimizar la penalización por retardos de conectores causada por las dependencias recursivas en dos bucles críticos del hardware: la generación de la dirección de búsqueda de instrucciones y la lógica de asignación de clústers. En el primer caso, los retardos de los conectores intra-estructurales de un predictor centralizado se convierten en retardos de comunicación entre clústers, que se pueden segmentar fácilmente. En el segundo caso, el particionado de la lógica de asignación de clústers basada en dependencias implica paralelizar esta tarea, intrínsecamente secuencial.The objective of this thesis is to propose new techniques to design efficient clustered superscalar microarchitectures. Clustered microarchitectures partition the layout of several critical hardware components as a means to keep most of the parallelism while improving the scalability. A clustered processor core, made up of several low complex blocks or clusters, can efficiently execute chains of dependent instructions without paying the overheads of a long issue, register read or bypass latencies. Of course, when two dependent instructions execute in different clusters, an inter-cluster communication penalty is incurred. Moreover, distributed structures usually imply lower dynamic power requirements, and simplify power management via techniques such a selective clock/power gating and voltage scaling.The first target of this research is the assignment of instructions to clusters, since it plays a major role on performance, with the goals of keeping the workload of clusters balanced and reducing the penalty of critical communications. Two different approaches are proposed: first, a family of new schemes that dynamically identify groups of data-dependent instructions called slices, and make cluster assignments on a per-slice basis. The proposed schemes differ from previous approaches either because they are dynamic and/or because they include new mechanisms to deal explicitly with workload balance information gathered at runtime. Second, it proposes a family of new dynamic schemes that assign instructions to clusters in a per-instruction basis, based on prior assignment of the source register producers, on the cluster location of the source physical registers, and on the workload of clusters.The second contribution proposes value prediction as a means to mitigate the penalties of wire delays and, in particular, to hide inter-cluster communications while also improving workload balance. First, it is proven that the benefit of breaking dependences with value prediction grows with the number of clusters and the communication latency, thus it is higher than for a centralized architecture. Second, it is proposed a cluster assignment scheme that exploits the less dense data dependence graph that results from predicting values to achieve a better workload balance.The third aspect considered is the cluster interconnect, which mainly determines communication latency, seeking for the best trade-off between cost and performance. First, several cost-effective point-to-point interconnects are proposed, both synchronous and partially asynchronous, that approach the IPC of an ideal model with unlimited bandwidth while keeping the complexity low. The proposed interconnects have much lower impact than other approaches on the complexity of bypasses, issue queues and register files. Second, possible router implementations are proposed, which illustrate their feasibility with very simple and low-latency hardware solutions. Third, a new topology-aware improvement to the cluster assignment scheme is proposed to reduce the distance (and latency) of inter-cluster communications.The last contribution proposes techniques for distributing the main components of the processor front-end with the goals of reducing their complexity and avoiding replication. In particular, effective techniques are proposed to cluster the branch predictor and the steering logic, that minimize the wire delay penalties caused by broadcasting recursive dependences in two critical hardware loops: the fetch address generation, and the cluster assignment logic, respectively. In the former case, the proposed technique converts the cross-structure wire delays of a centralized predictor into cross-cluster communication delays, which are smoothly pipelined. In the latter case, the partitioning of the instruction steering logic involves the parallelization of an inherently sequential task such as the dependence based cluster assignment of instructions

    Using MCD-DVS for dynamic thermal management performance improvement

    Get PDF
    With chip temperature being a major hurdle in microprocessor design, techniques to recover the performance loss due to thermal emergency mechanisms are crucial in order to sustain performance growth. Many techniques for power reduction in the past and some on thermal management more recently have contributed to alleviate this problem. Probably the most important thermal control technique is dynamic voltage and frequency scaling (DVS) which allows for almost cubic reduction in power with worst-case performance penalty only linear. So far, DVS techniques for temperature control have been studied at the chip level. Finer grain DVS is feasible if a globally-asynchronous locally-synchronous (GALS) design style is employed. GALS, also known as multiple-clock domain (MCD), allows for an independent voltage and frequency control for each one of the clock domains that are part of the chip. There are several studies on DVS for GALS that aim to improve energy and power efficiency but not temperature. This paper proposes and analyses the usage of DVS at the domain level to control temperature in a clustered MCD microarchitecture with the goal of improving the performance of applications that do not meet the thermal constraints imposed by the designers.Peer ReviewedPostprint (published version

    Empowering a helper cluster through data-width aware instruction selection policies

    Get PDF
    Narrow values that can be represented by less number of bits than the full machine width occur very frequently in programs. On the other hand, clustering mechanisms enable cost- and performance-effective scaling of processor back-end features. Those attributes can be combined synergistically to design special clusters operating on narrow values (a.k.a. helper cluster), potentially providing performance benefits. We complement a 32-bit monolithic processor with a low-complexity 8-bit helper cluster. Then, in our main focus, we propose various ideas to select suitable instructions to execute in the data-width based clusters. We add data-width information as another instruction steering decision metric and introduce new data-width based selection algorithms which also consider dependency, inter-cluster communication and load imbalance. Utilizing those techniques, the performance of a wide range of workloads are substantially increased; helper cluster achieves an average speedup of 11% for a wide range of 412 apps. When focusing on integer applications, the speedup can be as high as 22% on averagePeer ReviewedPostprint (published version

    Inherently workload-balanced clustered microarchitecture

    Get PDF
    The performance of clustered microarchitectures relies on steering schemes that try to find the best trade-off between workload balance and inter-cluster communication penalties. In previously proposed clustered processors, reducing communication penalties and balancing the workload are opposite targets, since improving one usually implies a detriment in the other. In this paper we propose a new clustered microarchitecture that can minimize communication penalties without compromising workload balance. The key idea is to arrange the clusters in a ring topology in such a way that results of one cluster can be forwarded to the neighbor cluster with a very short latency. In this way, minimizing communication penalties is favored when the producer of a value and its consumer are placed in adjacent clusters, which also favors workload balance. The proposed microarchitecture is shown to outperform a state-of-the-art clustered processor. For instance, for an 8-cluster configuration and just one fully pipelined unidirectional bus, 15% speedup is achieved on average for FP programs.Peer ReviewedPostprint (published version

    Thermal-aware clustered microarchitectures

    Get PDF
    As frequencies and feature size scale faster than operating voltages, power density is increasing in each processor generation. Power density and the cost of removing the heat it generates are increasing at the same rate. Leakage is significantly increasing every process generation and it is expected to be the main source of power in the near future. Moreover, leakage power grows exponentially with temperature. This paper proposes and evaluates several techniques with two goals: reduction of average temperature in order to decrease leakage power, and reduction of peak temperature in order to reduce cooling cost. Combinations of temperature-aware steering techniques and cluster hopping are investigated in a quad-cluster superscalar microarchitecture. Combining cluster hopping with a temperature-aware steering policy results in 30% reduction in leakage power and 8% reduction in average peak temperature at the expense of a slowdown of just 5%.Peer ReviewedPostprint (published version

    Dynamically managing the communication-parallelism trade-off in future clustered processors

    Get PDF
    Journal ArticleClustered microarchitectures are an attractive alternative to large monolithic superscalar designs due to their potential for higher clock rates in the face of increasingly wire-delay-constrained process technologies. As increasing transistor counts allow an increase in the number of clusters, thereby allowing more aggressive use of instruction-level parallelism (ILP), the inter-cluster communication increases as data values get spread across a wider area. As a result of the emergence of this trade-off between communication and parallelism, a subset of the total on-chip clusters is optimal for performance. To match the hardware to the application's needs, we use a robust algorithm to dynamically tune the clustered architecture. The algorithm, which is based on program metrics gathered at periodic intervals, achieves an 11% performance improvement on average over the best statically defined architecture. We also show that the use of additional hardware and reconfiguration at basic block boundaries can achieve average improvements of 15%. Our results demonstrate that reconfiguration provides an effective solution to the communication and parallelism trade-off inherent in the communication-bound processors of the future

    Efficient interconnects for clustered microarchitectures

    Get PDF
    Clustering is an effective microarchitectural technique for reducing the impact of wire delays, the complexity, and the power requirements of microprocessors. In this work, we investigate the design of on-chip interconnection networks for clustered microarchitectures. This new class of interconnects has different demands and characteristics than traditional multiprocessor networks. In a clustered microarchitecture, a low inter-cluster communication latency is essential for high performance. We propose point-to-point interconnects together with an effective latency-aware instruction steering scheme and show that they achieve much better performance than busbased interconnects. The results show that the connectivity of the network together with latency-aware steering schemes are key for high performance. We also show tha

    Efficient interconnects for clustered microarchitectures

    No full text

    Efficient interconnects for clustered microarchitectures

    No full text
    Clustering is an effective microarchitectural technique for reducing the impact of wire delays, the complexity, and the power requirements of microprocessors. In this work, we investigate the design of on-chip interconnection networks for clustered microarchitectures. This new class of interconnects has different demands and characteristics than traditional multiprocessor networks. In a clustered microarchitecture, a low inter-cluster communication latency is essential for high performance. We propose point-to-point interconnects together with an effective latency-aware instruction steering scheme and show that they achieve much better performance than bus-based interconnects. The results show that the connectivity of the network together with latency-aware steering schemes are key for high performance. We also show that these interconnects can be built with simple hardware and achieve a performance close to that of an idealized contention-free model.Peer Reviewe
    corecore