11,519 research outputs found

    Empowering a helper cluster through data-width aware instruction selection policies

    Get PDF
    Narrow values that can be represented by less number of bits than the full machine width occur very frequently in programs. On the other hand, clustering mechanisms enable cost- and performance-effective scaling of processor back-end features. Those attributes can be combined synergistically to design special clusters operating on narrow values (a.k.a. helper cluster), potentially providing performance benefits. We complement a 32-bit monolithic processor with a low-complexity 8-bit helper cluster. Then, in our main focus, we propose various ideas to select suitable instructions to execute in the data-width based clusters. We add data-width information as another instruction steering decision metric and introduce new data-width based selection algorithms which also consider dependency, inter-cluster communication and load imbalance. Utilizing those techniques, the performance of a wide range of workloads are substantially increased; helper cluster achieves an average speedup of 11% for a wide range of 412 apps. When focusing on integer applications, the speedup can be as high as 22% on averagePeer ReviewedPostprint (published version

    Inherently workload-balanced clustered microarchitecture

    Get PDF
    The performance of clustered microarchitectures relies on steering schemes that try to find the best trade-off between workload balance and inter-cluster communication penalties. In previously proposed clustered processors, reducing communication penalties and balancing the workload are opposite targets, since improving one usually implies a detriment in the other. In this paper we propose a new clustered microarchitecture that can minimize communication penalties without compromising workload balance. The key idea is to arrange the clusters in a ring topology in such a way that results of one cluster can be forwarded to the neighbor cluster with a very short latency. In this way, minimizing communication penalties is favored when the producer of a value and its consumer are placed in adjacent clusters, which also favors workload balance. The proposed microarchitecture is shown to outperform a state-of-the-art clustered processor. For instance, for an 8-cluster configuration and just one fully pipelined unidirectional bus, 15% speedup is achieved on average for FP programs.Peer ReviewedPostprint (published version

    On-Chip Transparent Wire Pipelining (invited paper)

    Get PDF
    Wire pipelining has been proposed as a viable mean to break the discrepancy between decreasing gate delays and increasing wire delays in deep-submicron technologies. Far from being a straightforwardly applicable technique, this methodology requires a number of design modifications in order to insert it seamlessly in the current design flow. In this paper we briefly survey the methods presented by other researchers in the field and then we thoroughly analyze the solutions we recently proposed, ranging from system-level wire pipelining to physical design aspects

    Design of Clustered Superscalar Microarchitectures

    Get PDF
    L'objectiu d'aquesta tesi és proposar noves tècniques per al disseny de microarquitectures clúster superescalars eficients. Les microarquitectures clúster particionen el disseny de diversos components crítics del hardware com a mitjà per mantenir-ne el paral·lelisme i millorar-ne l'escalabilitat. El nucli d'un processador clúster, format per blocs de baixa complexitat o clústers, pot executar cadenes d'instruccions dependents sense pagar el sobrecost d'una llarga emissió, curtcircuïts, o lectura de registres, encara que si dues instruccions dependents s'executen en clústers diferents, es paga la penalització d'una comunicació. Per altra banda, les estructures distribuïdes impliquen generalment menors requisits de potència dinàmica, i simplifiquen la gestió de l'energia per mitjà de tècniques com la desactivació selectiva del rellotge o l'energia, o com la reducció a escalade la tensió.El primer objecte d'aquesta recerca és l'assignació d'instruccions a clústers, ja que aquesta juga un paper clau en el rendiment, amb l'objectiu de mantenir equilibrada la càrrega i reduir la penalització de les comunicacions crítiques. Es proposen dos diferents enfocs: primer, una família de nous esquemes que identifiquen dinàmicament certs grups d'instruccions dependents anomenats "slices", i fan l'assignació de clústers slice per slice. Es diferencien d'altres enfocs previs, ja sigui perquè són dinàmics i/o bé perquè inclouen nous mecanismes explícits de mesura i gestió de l'equilibri de càrrega. Segon, una família de nous esquemes que assignen clústers instrucció per instrucció, basats en les assignacions prèvies dels productors dels registres fonts, en la ubicació dels registres físics, i en la càrrega de treball.La segona contribució proposa la predicció de valors com a mitjà per mitigar les penalitzacions dels retards dels connectors i, en particular, per amagar les comunicacions entre clústers. Es demostra que el benefici obtingut amb l'eliminació de dependències creix amb el nombre de clústers i la latència de les comunicacions i, doncs, és major que per a una arquitectura centralitzada. Es proposa un nou esquema d'assignació de clústers que aprofita la menor densitat del graf de dependències per tal de millorar l'equilibri de càrrega.El tercer aspecte considerat es la xarxa d'interconnexió entre clústers, ja que determina la latència de les comunicacions, amb l'objectiu de trobar el millor compromís entre cost i rendiment. Es proposen diverses xarxes punt-a-punt, tant síncrones com parcialment asíncrones, que assoleixen un IPC pròxim al d'un model ideal amb ample de banda il·limitat, tot i tenir molt baixa complexitat. Llur impacte sobre els curtcircuïts, cues d'emissió o bancs de registres es molt menor que el d'altres enfocs. Es proposen també possibles implementacions dels enrutadors, que il·lustren llur factibilitat amb solucions hardware molt simples i de baixa latència. Es proposa un nou esquema d'assignació de clústers conscient de la topologia, que redueix la latència de les comunicacions.L'última contribució proposa tècniques per distribuir els components principals de les etapes inicials del processador, amb l'objectiu de reduir-ne la complexitat i evitar-ne la replicació. Es proposen tècniques eficaces per a la partició del predictor de salts i la lògica de distribució d'instruccions, a fi de minimitzar la penalització pels retards dels connectors causada per les dependències recursives en dos llaços crítics del hardware: la generació de l'adreça de búsqueda d'instruccions i la lògica d'assignació d'instruccions, respectivament. En el primer cas, es converteixen els retards dels connectors intra-estructurals d'un predictor centralitzat en retards de comunicació entre clústers, els quals se segmenten sense problemes. En el segon cas, el particionat de la lògica d'assignació d'instruccions basada en dependències implica paral·lelitzar aquesta tasca, la qual es inherentment seqüencial.El objetivo de esta tesis es proponer técnicas para el diseño de microarquitecturas clúster superescalares eficientes. Las microarquitecturas clúster particionan el diseño de diversos componentes críticos del hardware como medio para mantener el paralelismo y mejorar la escalabilidad. El núcleo de un procesador clúster, formado por bloques de baja complejidad o clústers, puede ejectutar cadenas de instrucciones dependientes sin pagar el sobrecoste de una larga emisión, cortocircuitos, o lectura de registros; pero si dos instrucciones dependientes se ejecutan en clústers distintos, se paga la penalización de una comunicación. Por otro lado, las estructuras distribuidas implican generalmente menores requisitos de potencia dinámica, y simplifican la gestión de la energía por medio de técnicas como la desactivación selectiva del reloj o de la alimentación, o la reducción a escala del voltaje.El primer objetivo de esta investigación es la asignación dinámica de instrucciones a clústers, ya que ésta juega un papel clave en el rendimiento, a fin de mantener equilibrada la carga y reducir la penalización de las comunicaciones críticas. Se proponen dos enfoques distintos: primero, una familia de nuevos esquemas que identifican dinámicamente ciertos grupos de instrucciones denominados "slices", y realizan la asignación slice por slice. Éstos se diferencian de otros enfoques previos, ya sea porque son dinámicos y/o porque incluyen nuevos mecanismos explícitos de medida y gestión del equilibrio de carga. Segundo, una familia de nuevos esquemas que asignan clústers instrucción a instrucción, basándose en las asignaciones previas de los productores de sus registros fuente, en la ubicación de los registros físicos, y en la carga de trabajo.La segunda contribución propone la predicción de valores como medio para mitigar las penalizaciones de los retardos de los conectores, y en particular, para esconder las comunicaciones entre clústers. Se demuestra que el beneficio obtenido con la eliminación de dependencias aumenta con el número de clústers y con la latencia de las comunicaciones, y es asimismo mayor que para una arquitectura centralizada. Se propone un nuevo esquema de asignación de clústers que aprovecha la menor densidad del grafo de dependencias con el fin de mejorar el equilibrio de la carga.El tercer aspecto considerado es la red de interconexión entre clústers, pues determina la latencia de las comunicaciones, a fin de hallar el mejor compromiso entre coste y rendimiento. Se proponen diversas redes punto a punto, tanto síncronas como parcialmente asíncronas, que aun teniendo muy baja complejidad consiguen un IPC próximo al de un modelo con ancho de banda ilimitado. Su impacto sobre la complejidad de los cortocircuitos, colas de emisión o bancos de registros es mucho menor que el de otros enfoques. Se proponen también posibles implementaciones de los enrutadores, ilustrando su factibilidad como soluciones simples y de baja latencia. Se propone un esquema de asignación de clústers consciente de la topología, que reduce la latencia de las comunicaciones.La última contribución propone técnicas para distribuir los componentes principales de las etapas iniciales del procesador, con el objetivo de reducir su complejidad y evitar su replicación. Se proponen técnicas eficaces para particionar el predictor de saltos y la lógica de distribución de instrucciones, a fin de minimizar la penalización por retardos de conectores causada por las dependencias recursivas en dos bucles críticos del hardware: la generación de la dirección de búsqueda de instrucciones y la lógica de asignación de clústers. En el primer caso, los retardos de los conectores intra-estructurales de un predictor centralizado se convierten en retardos de comunicación entre clústers, que se pueden segmentar fácilmente. En el segundo caso, el particionado de la lógica de asignación de clústers basada en dependencias implica paralelizar esta tarea, intrínsecamente secuencial.The objective of this thesis is to propose new techniques to design efficient clustered superscalar microarchitectures. Clustered microarchitectures partition the layout of several critical hardware components as a means to keep most of the parallelism while improving the scalability. A clustered processor core, made up of several low complex blocks or clusters, can efficiently execute chains of dependent instructions without paying the overheads of a long issue, register read or bypass latencies. Of course, when two dependent instructions execute in different clusters, an inter-cluster communication penalty is incurred. Moreover, distributed structures usually imply lower dynamic power requirements, and simplify power management via techniques such a selective clock/power gating and voltage scaling.The first target of this research is the assignment of instructions to clusters, since it plays a major role on performance, with the goals of keeping the workload of clusters balanced and reducing the penalty of critical communications. Two different approaches are proposed: first, a family of new schemes that dynamically identify groups of data-dependent instructions called slices, and make cluster assignments on a per-slice basis. The proposed schemes differ from previous approaches either because they are dynamic and/or because they include new mechanisms to deal explicitly with workload balance information gathered at runtime. Second, it proposes a family of new dynamic schemes that assign instructions to clusters in a per-instruction basis, based on prior assignment of the source register producers, on the cluster location of the source physical registers, and on the workload of clusters.The second contribution proposes value prediction as a means to mitigate the penalties of wire delays and, in particular, to hide inter-cluster communications while also improving workload balance. First, it is proven that the benefit of breaking dependences with value prediction grows with the number of clusters and the communication latency, thus it is higher than for a centralized architecture. Second, it is proposed a cluster assignment scheme that exploits the less dense data dependence graph that results from predicting values to achieve a better workload balance.The third aspect considered is the cluster interconnect, which mainly determines communication latency, seeking for the best trade-off between cost and performance. First, several cost-effective point-to-point interconnects are proposed, both synchronous and partially asynchronous, that approach the IPC of an ideal model with unlimited bandwidth while keeping the complexity low. The proposed interconnects have much lower impact than other approaches on the complexity of bypasses, issue queues and register files. Second, possible router implementations are proposed, which illustrate their feasibility with very simple and low-latency hardware solutions. Third, a new topology-aware improvement to the cluster assignment scheme is proposed to reduce the distance (and latency) of inter-cluster communications.The last contribution proposes techniques for distributing the main components of the processor front-end with the goals of reducing their complexity and avoiding replication. In particular, effective techniques are proposed to cluster the branch predictor and the steering logic, that minimize the wire delay penalties caused by broadcasting recursive dependences in two critical hardware loops: the fetch address generation, and the cluster assignment logic, respectively. In the former case, the proposed technique converts the cross-structure wire delays of a centralized predictor into cross-cluster communication delays, which are smoothly pipelined. In the latter case, the partitioning of the instruction steering logic involves the parallelization of an inherently sequential task such as the dependence based cluster assignment of instructions

    Low Power Processor Architectures and Contemporary Techniques for Power Optimization – A Review

    Get PDF
    The technological evolution has increased the number of transistors for a given die area significantly and increased the switching speed from few MHz to GHz range. Such inversely proportional decline in size and boost in performance consequently demands shrinking of supply voltage and effective power dissipation in chips with millions of transistors. This has triggered substantial amount of research in power reduction techniques into almost every aspect of the chip and particularly the processor cores contained in the chip. This paper presents an overview of techniques for achieving the power efficiency mainly at the processor core level but also visits related domains such as buses and memories. There are various processor parameters and features such as supply voltage, clock frequency, cache and pipelining which can be optimized to reduce the power consumption of the processor. This paper discusses various ways in which these parameters can be optimized. Also, emerging power efficient processor architectures are overviewed and research activities are discussed which should help reader identify how these factors in a processor contribute to power consumption. Some of these concepts have been already established whereas others are still active research areas. © 2009 ACADEMY PUBLISHER

    Re-visiting the performance impact of microarchitectural floorplanning

    Get PDF
    Journal ArticleThe placement of microarchitectural blocks on a die can significantly impact operating temperature. A floorplan that is optimized for low temperature can negatively impact performance by introducing wire delays between critical pipeline stages. In this paper, we identify subsets of wire delays that can and cannot be tolerated. These subsets are different from those identified by prior work. This paper also makes the case that floorplanning algorithms must consider the impact of floorplans on bypassing complexity and instruction replay mechanisms

    Microarchitectural techniques to reduce interconnect power in clustered processors

    Get PDF
    Journal ArticleThe paper presents a preliminary evaluation of novel techniques that address a growing problem - power dissipation in on-chip interconnects. Recent studies have shown that around 50% of the dynamic power consumption in modern processors is within on-chip interconnects. The contribution of interconnect power to total chip power is expected to be higher in future communication-bound billion-transistor architectures. In this paper, we propose the design of a heterogeneous interconnect, where some wires are optimized for low latency and others are optimized for low power. We show that a large fraction of on-chip communications are latency insensitive. Effecting these non-critical transfers on low-power long-latency interconnects can result in significant power savings without unduly affecting performance. Two primary techniques are evaluated in this paper: (i) a dynamic critical path predictor that identifies results that are not urgently consumed, and (ii) an address prediction mechanism that requires addresses to be transferred off the critical path for verification purposes. Our results demonstrate that 49% of all interconnect transfers can be effected on power-efficient wires, while incurring a performance penalty of only 2.5%

    Cache Equalizer: A Cache Pressure Aware Block Placement Scheme for Large-Scale Chip Multiprocessors

    Get PDF
    This paper describes Cache Equalizer (CE), a novel distributed cache management scheme for large scale chip multiprocessors (CMPs). Our work is motivated by large asymmetry in cache sets usages. CE decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Temporal pressure at the on-chip last-level cache, is continuously collected at a group (comprised of cache sets) granularity, and periodically recorded at the memory controller to guide the placement process. An incoming block is consequently placed at a cache group that exhibits the minimum pressure. CE provides Quality of Service (QoS) by robustly offering better performance than the baseline shared NUCA cache. Simulation results using a full-system simulator demonstrate that CE outperforms shared NUCA caches by an average of 15.5% and by as much as 28.5% for the benchmark programs we examined. Furthermore, evaluations manifested the outperformance of CE versus related CMP cache designs
    corecore