1,661 research outputs found

    Efficient resources assignment schemes for clustered multithreaded processors

    Get PDF
    New feature sizes provide larger number of transistors per chip that architects could use in order to further exploit instruction level parallelism. However, these technologies bring also new challenges that complicate conventional monolithic processor designs. On the one hand, exploiting instruction level parallelism is leading us to diminishing returns and therefore exploiting other sources of parallelism like thread level parallelism is needed in order to keep raising performance with a reasonable hardware complexity. On the other hand, clustering architectures have been widely studied in order to reduce the inherent complexity of current monolithic processors. This paper studies the synergies and trade-offs between two concepts, clustering and simultaneous multithreading (SMT), in order to understand the reasons why conventional SMT resource assignment schemes are not so effective in clustered processors. These trade-offs are used to propose a novel resource assignment scheme that gets and average speed up of 17.6% versus Icount improving fairness in 24%.Peer ReviewedPostprint (published version

    Clustered VLIW architecture based on queue register files

    Get PDF
    Institute for Computing Systems ArchitectureInstruction-level parallelism (ILP) is a set of hardware and software techniques that allow parallel execution of machine operations. Superscalar architectures rely most heavily upon hardware schemes to identify parallelism among operations. Although successful in terms of performance, the hardware complexity involved might limit the scalability of this model. VLIW architectures use a different approach to exploit ILP. In this case all data dependence analyses and scheduling of operations are performed at compile time, resulting in a simpler hardware organization. This allows the inclusion of a larger number of functional units (FUs) into a single chip. IN spite of this relative simplification, the scalability of VLIW architectures can be constrained by the size and number of ports of the register file. VLIW machines often use software pipelining techniques to improve the execution of loop structures, which can increase the register pressure. Furthermore, the access time of a register file can be compromised by the number of ports, causing a negative impact on the machine cycle time. For these reasons we understand that the benefits of having parallel FUs, which have motivated the investigation of alternative machine designs. This thesis presents a scalar VLIW architecture comprising clusters of FUs and private register files. Register files organised as queue structures are used as a mechanism for inter-cluster communication, allowing the enforcement of fixed latency in the process. This scheme presents better possibilities in terms of scalability as the size of the individual register files is not determined by the total number of FUs, suggesting that the silicon area may grow only linearly with respect to the total number of FUs. However, the effectiveness of such an organization depends on the efficiency of the code partitioning strategy. We have developed an algorithm for a clustered VLIW architecture integrating both software pipelining and code partitioning in a a single procedure. Experimental results show it may allow performance levels close to an unclustered machine without communication restraints. Finally, we have developed silicon area and cycle time models to quantify the scalability of performance and cost for this class of architecture

    Design of Clustered Superscalar Microarchitectures

    Get PDF
    L'objectiu d'aquesta tesi és proposar noves tècniques per al disseny de microarquitectures clúster superescalars eficients. Les microarquitectures clúster particionen el disseny de diversos components crítics del hardware com a mitjà per mantenir-ne el paral·lelisme i millorar-ne l'escalabilitat. El nucli d'un processador clúster, format per blocs de baixa complexitat o clústers, pot executar cadenes d'instruccions dependents sense pagar el sobrecost d'una llarga emissió, curtcircuïts, o lectura de registres, encara que si dues instruccions dependents s'executen en clústers diferents, es paga la penalització d'una comunicació. Per altra banda, les estructures distribuïdes impliquen generalment menors requisits de potència dinàmica, i simplifiquen la gestió de l'energia per mitjà de tècniques com la desactivació selectiva del rellotge o l'energia, o com la reducció a escalade la tensió.El primer objecte d'aquesta recerca és l'assignació d'instruccions a clústers, ja que aquesta juga un paper clau en el rendiment, amb l'objectiu de mantenir equilibrada la càrrega i reduir la penalització de les comunicacions crítiques. Es proposen dos diferents enfocs: primer, una família de nous esquemes que identifiquen dinàmicament certs grups d'instruccions dependents anomenats "slices", i fan l'assignació de clústers slice per slice. Es diferencien d'altres enfocs previs, ja sigui perquè són dinàmics i/o bé perquè inclouen nous mecanismes explícits de mesura i gestió de l'equilibri de càrrega. Segon, una família de nous esquemes que assignen clústers instrucció per instrucció, basats en les assignacions prèvies dels productors dels registres fonts, en la ubicació dels registres físics, i en la càrrega de treball.La segona contribució proposa la predicció de valors com a mitjà per mitigar les penalitzacions dels retards dels connectors i, en particular, per amagar les comunicacions entre clústers. Es demostra que el benefici obtingut amb l'eliminació de dependències creix amb el nombre de clústers i la latència de les comunicacions i, doncs, és major que per a una arquitectura centralitzada. Es proposa un nou esquema d'assignació de clústers que aprofita la menor densitat del graf de dependències per tal de millorar l'equilibri de càrrega.El tercer aspecte considerat es la xarxa d'interconnexió entre clústers, ja que determina la latència de les comunicacions, amb l'objectiu de trobar el millor compromís entre cost i rendiment. Es proposen diverses xarxes punt-a-punt, tant síncrones com parcialment asíncrones, que assoleixen un IPC pròxim al d'un model ideal amb ample de banda il·limitat, tot i tenir molt baixa complexitat. Llur impacte sobre els curtcircuïts, cues d'emissió o bancs de registres es molt menor que el d'altres enfocs. Es proposen també possibles implementacions dels enrutadors, que il·lustren llur factibilitat amb solucions hardware molt simples i de baixa latència. Es proposa un nou esquema d'assignació de clústers conscient de la topologia, que redueix la latència de les comunicacions.L'última contribució proposa tècniques per distribuir els components principals de les etapes inicials del processador, amb l'objectiu de reduir-ne la complexitat i evitar-ne la replicació. Es proposen tècniques eficaces per a la partició del predictor de salts i la lògica de distribució d'instruccions, a fi de minimitzar la penalització pels retards dels connectors causada per les dependències recursives en dos llaços crítics del hardware: la generació de l'adreça de búsqueda d'instruccions i la lògica d'assignació d'instruccions, respectivament. En el primer cas, es converteixen els retards dels connectors intra-estructurals d'un predictor centralitzat en retards de comunicació entre clústers, els quals se segmenten sense problemes. En el segon cas, el particionat de la lògica d'assignació d'instruccions basada en dependències implica paral·lelitzar aquesta tasca, la qual es inherentment seqüencial.El objetivo de esta tesis es proponer técnicas para el diseño de microarquitecturas clúster superescalares eficientes. Las microarquitecturas clúster particionan el diseño de diversos componentes críticos del hardware como medio para mantener el paralelismo y mejorar la escalabilidad. El núcleo de un procesador clúster, formado por bloques de baja complejidad o clústers, puede ejectutar cadenas de instrucciones dependientes sin pagar el sobrecoste de una larga emisión, cortocircuitos, o lectura de registros; pero si dos instrucciones dependientes se ejecutan en clústers distintos, se paga la penalización de una comunicación. Por otro lado, las estructuras distribuidas implican generalmente menores requisitos de potencia dinámica, y simplifican la gestión de la energía por medio de técnicas como la desactivación selectiva del reloj o de la alimentación, o la reducción a escala del voltaje.El primer objetivo de esta investigación es la asignación dinámica de instrucciones a clústers, ya que ésta juega un papel clave en el rendimiento, a fin de mantener equilibrada la carga y reducir la penalización de las comunicaciones críticas. Se proponen dos enfoques distintos: primero, una familia de nuevos esquemas que identifican dinámicamente ciertos grupos de instrucciones denominados "slices", y realizan la asignación slice por slice. Éstos se diferencian de otros enfoques previos, ya sea porque son dinámicos y/o porque incluyen nuevos mecanismos explícitos de medida y gestión del equilibrio de carga. Segundo, una familia de nuevos esquemas que asignan clústers instrucción a instrucción, basándose en las asignaciones previas de los productores de sus registros fuente, en la ubicación de los registros físicos, y en la carga de trabajo.La segunda contribución propone la predicción de valores como medio para mitigar las penalizaciones de los retardos de los conectores, y en particular, para esconder las comunicaciones entre clústers. Se demuestra que el beneficio obtenido con la eliminación de dependencias aumenta con el número de clústers y con la latencia de las comunicaciones, y es asimismo mayor que para una arquitectura centralizada. Se propone un nuevo esquema de asignación de clústers que aprovecha la menor densidad del grafo de dependencias con el fin de mejorar el equilibrio de la carga.El tercer aspecto considerado es la red de interconexión entre clústers, pues determina la latencia de las comunicaciones, a fin de hallar el mejor compromiso entre coste y rendimiento. Se proponen diversas redes punto a punto, tanto síncronas como parcialmente asíncronas, que aun teniendo muy baja complejidad consiguen un IPC próximo al de un modelo con ancho de banda ilimitado. Su impacto sobre la complejidad de los cortocircuitos, colas de emisión o bancos de registros es mucho menor que el de otros enfoques. Se proponen también posibles implementaciones de los enrutadores, ilustrando su factibilidad como soluciones simples y de baja latencia. Se propone un esquema de asignación de clústers consciente de la topología, que reduce la latencia de las comunicaciones.La última contribución propone técnicas para distribuir los componentes principales de las etapas iniciales del procesador, con el objetivo de reducir su complejidad y evitar su replicación. Se proponen técnicas eficaces para particionar el predictor de saltos y la lógica de distribución de instrucciones, a fin de minimizar la penalización por retardos de conectores causada por las dependencias recursivas en dos bucles críticos del hardware: la generación de la dirección de búsqueda de instrucciones y la lógica de asignación de clústers. En el primer caso, los retardos de los conectores intra-estructurales de un predictor centralizado se convierten en retardos de comunicación entre clústers, que se pueden segmentar fácilmente. En el segundo caso, el particionado de la lógica de asignación de clústers basada en dependencias implica paralelizar esta tarea, intrínsecamente secuencial.The objective of this thesis is to propose new techniques to design efficient clustered superscalar microarchitectures. Clustered microarchitectures partition the layout of several critical hardware components as a means to keep most of the parallelism while improving the scalability. A clustered processor core, made up of several low complex blocks or clusters, can efficiently execute chains of dependent instructions without paying the overheads of a long issue, register read or bypass latencies. Of course, when two dependent instructions execute in different clusters, an inter-cluster communication penalty is incurred. Moreover, distributed structures usually imply lower dynamic power requirements, and simplify power management via techniques such a selective clock/power gating and voltage scaling.The first target of this research is the assignment of instructions to clusters, since it plays a major role on performance, with the goals of keeping the workload of clusters balanced and reducing the penalty of critical communications. Two different approaches are proposed: first, a family of new schemes that dynamically identify groups of data-dependent instructions called slices, and make cluster assignments on a per-slice basis. The proposed schemes differ from previous approaches either because they are dynamic and/or because they include new mechanisms to deal explicitly with workload balance information gathered at runtime. Second, it proposes a family of new dynamic schemes that assign instructions to clusters in a per-instruction basis, based on prior assignment of the source register producers, on the cluster location of the source physical registers, and on the workload of clusters.The second contribution proposes value prediction as a means to mitigate the penalties of wire delays and, in particular, to hide inter-cluster communications while also improving workload balance. First, it is proven that the benefit of breaking dependences with value prediction grows with the number of clusters and the communication latency, thus it is higher than for a centralized architecture. Second, it is proposed a cluster assignment scheme that exploits the less dense data dependence graph that results from predicting values to achieve a better workload balance.The third aspect considered is the cluster interconnect, which mainly determines communication latency, seeking for the best trade-off between cost and performance. First, several cost-effective point-to-point interconnects are proposed, both synchronous and partially asynchronous, that approach the IPC of an ideal model with unlimited bandwidth while keeping the complexity low. The proposed interconnects have much lower impact than other approaches on the complexity of bypasses, issue queues and register files. Second, possible router implementations are proposed, which illustrate their feasibility with very simple and low-latency hardware solutions. Third, a new topology-aware improvement to the cluster assignment scheme is proposed to reduce the distance (and latency) of inter-cluster communications.The last contribution proposes techniques for distributing the main components of the processor front-end with the goals of reducing their complexity and avoiding replication. In particular, effective techniques are proposed to cluster the branch predictor and the steering logic, that minimize the wire delay penalties caused by broadcasting recursive dependences in two critical hardware loops: the fetch address generation, and the cluster assignment logic, respectively. In the former case, the proposed technique converts the cross-structure wire delays of a centralized predictor into cross-cluster communication delays, which are smoothly pipelined. In the latter case, the partitioning of the instruction steering logic involves the parallelization of an inherently sequential task such as the dependence based cluster assignment of instructions

    Hierarchical clustered register file organization for VLIW processors

    Get PDF
    Technology projections indicate that wire delays will become one of the biggest constraints in future microprocessor designs. To avoid long wire delays and therefore long cycle times, processor cores must be partitioned into components so that most of the communication is done locally. In this paper, we propose a novel register file organization for VLIW cores that combines clustering with a hierarchical register file organization. Functional units are organized in clusters, each one with a local first level register file. The local register files are connected to a global second level register file, which provides access to memory. All intercluster communications are done through the second level register file. This paper also proposes MIRS-HC, a novel modulo scheduling technique that simultaneously performs instruction scheduling, cluster selection, inserts communication operations, performs register allocation and spill insertion for the proposed organization. The results show that although more cycles are required to execute applications, the execution time is reduced due to a shorter cycle time. In addition, the combination of clustering and hierarchy provides a larger design exploration space that trades-off performance and technology requirements.Peer ReviewedPostprint (published version

    Hierarchical Agent-based Adaptation for Self-Aware Embedded Computing Systems

    Get PDF
    Siirretty Doriast

    Dynamically managing the communication-parallelism trade-off in future clustered processors

    Get PDF
    Journal ArticleClustered microarchitectures are an attractive alternative to large monolithic superscalar designs due to their potential for higher clock rates in the face of increasingly wire-delay-constrained process technologies. As increasing transistor counts allow an increase in the number of clusters, thereby allowing more aggressive use of instruction-level parallelism (ILP), the inter-cluster communication increases as data values get spread across a wider area. As a result of the emergence of this trade-off between communication and parallelism, a subset of the total on-chip clusters is optimal for performance. To match the hardware to the application's needs, we use a robust algorithm to dynamically tune the clustered architecture. The algorithm, which is based on program metrics gathered at periodic intervals, achieves an 11% performance improvement on average over the best statically defined architecture. We also show that the use of additional hardware and reconfiguration at basic block boundaries can achieve average improvements of 15%. Our results demonstrate that reconfiguration provides an effective solution to the communication and parallelism trade-off inherent in the communication-bound processors of the future

    RELEASE: A High-level Paradigm for Reliable Large-scale Server Software

    Get PDF
    Erlang is a functional language with a much-emulated model for building reliable distributed systems. This paper outlines the RELEASE project, and describes the progress in the first six months. The project aim is to scale the Erlang’s radical concurrency-oriented programming paradigm to build reliable general-purpose software, such as server-based systems, on massively parallel machines. Currently Erlang has inherently scalable computation and reliability models, but in practice scalability is constrained by aspects of the language and virtual machine. We are working at three levels to address these challenges: evolving the Erlang virtual machine so that it can work effectively on large scale multicore systems; evolving the language to Scalable Distributed (SD) Erlang; developing a scalable Erlang infrastructure to integrate multiple, heterogeneous clusters. We are also developing state of the art tools that allow programmers to understand the behaviour of massively parallel SD Erlang programs. We will demonstrate the effectiveness of the RELEASE approach using demonstrators and two large case studies on a Blue Gene
    • …
    corecore