6,877 research outputs found

    HW/SW mechanisms for instruction fusion, issue and commit in modern u-processors

    Get PDF
    In this thesis we have explored the co-designed paradigm to show alternative processor design points. Specifically, we have provided HW/SW mechanisms for instruction fusion, issue and commit for modern processors. We have implemented a co-designed virtual machine monitor that binary translates x86 instructions into RISC like micro-ops. Moreover, the translations are stored as superblocks, which are a trace of basic blocks. These superblocks are further optimized using speculative and non-speculative optimizations. Hardware mechanisms exists in-order to take corrective action in case of misspeculations. During the course of this PhD we have made following contributions. Firstly, we have provided a novel Programmable Functional unit, in-order to speed up general-purpose applications. The PFU consists of a grid of functional units, similar to CCA, and a distributed internal register file. The inputs of the macro-op are brought from the Physical Register File to the internal register file using a set of moves and a set of loads. A macro-op fusion algorithm fuses micro-ops at runtime. The fusion algorithm is based on a scheduling step that indicates whether the current fused instruction is beneficial or not. The micro-ops corresponding to the macro-ops are stored as control signals in a configuration. The macro-op consists of a configuration ID which helps in locating the configurations. A small configuration cache is present inside the Programmable Functional unit, that holds these configurations. In case of a miss in the configuration cache configurations are loaded from I-Cache. Moreover, in-order to support bulk commit of atomic superblocks that are larger than the ROB we have proposed a speculative commit mechanism. For this we have proposed a Speculative commit register map table that holds the mappings of the speculatively committed instructions. When all the instructions of the superblock have committed the speculative state is copied to Backend Register Rename Table. Secondly, we proposed a co-designed in-order processor with with two kinds of accelerators. These FU based accelerators run a pair of fused instructions. We have considered two kinds of instruction fusion. First, we fused a pair of independent loads together into vector loads and execute them on vector load units. For the second kind of instruction fusion we have fused a pair of dependent simple ALU instructions and execute them in Interlock Collapsing ALUs (ICALU). Moreover, we have evaluated performance of various code optimizations such as list-scheduling, load-store telescoping and load hoisting among others. We have compared our co-designed processor with small instruction window out-of-order processors. Thirdly, we have proposed a co-designed out-of-order processor. Specifically we have reduced complexity in two areas. First of all, we have co-designed the commit mechanism, that enable bulk commit of atomic superblocks. In this solution we got rid of the conventional ROB, instead we introduce the Superblock Ordering Buffer (SOB). SOB ensures program order is maintained at the granularity of the superblock, by bulk committing the program state. The program state consists of the register state and the memory state. The register state is held in a per superblock register map table, whereas the memory state is held in gated store buffer and updated in bulk. Furthermore, we have tackled the complexity of Out-of-Order issue logic by using FIFOs. We have proposed an enhanced steering heuristic that fixes the inefficiencies of the existing dependence-based heuristic. Moreover, a mechanism to release the FIFO entries earlier is also proposed that further improves the performance of the steering heuristic.En aquesta tesis hem explorat el paradigma de les màquines issue i commit per processadors actuals. Hem implementat una màquina virtual que tradueix binaris x86 a micro-ops de tipus RISC. Aquestes traduccions es guarden com a superblocks, que en realitat no és més que una traça de virtuals co-dissenyades. En particular, hem proposat mecanismes hw/sw per a la fusió d’instruccions, blocs bàsics. Aquests superblocks s’optimitzen utilitzant optimizacions especualtives i d’altres no speculatives. En cas de les optimizations especulatives es consideren mecanismes per a la gestió de errades en l’especulació. Al llarg d’aquesta tesis s’han fet les següents contribucions: Primer, hem proposat una nova unitat functional programmable (PFU) per tal de millorar l’execució d’aplicacions de proposit general. La PFU està formada per un conjunt d’unitats funcionals, similar al CCA, amb un banc de registres intern a la PFU distribuït a les unitats funcionals que la composen. Les entrades de la macro-operació que s’executa en la PFU es mouen del banc de registres físic convencional al intern fent servir un conjunt de moves i loads. Un algorisme de fusió combina més micro-operacions en temps d’execució. Aquest algorisme es basa en un pas de planificació que mesura el benefici de les decisions de fusió. Les micro operacions corresponents a la macro operació s’emmagatzemen com a senyals de control en una configuració. Les macro-operacions tenen associat un identificador de configuració que ajuda a localitzar d’aquestes. Una petita cache de configuracions està present dintre de la PFU per tal de guardar-les. En cas de que la configuració no estigui a la cache, les configuracions es carreguen de la cache d’instruccions. Per altre banda, per tal de donar support al commit atòmic dels superblocks que sobrepassen el tamany del ROB s’ha proposat un mecanisme de commit especulatiu. Per aquest mecanisme hem proposat una taula de mapeig especulativa dels registres, que es copia a la taula no especulativa quan totes les instruccions del superblock han comitejat. Segon, hem proposat un processador en order co-dissenyat que combina dos tipus d’acceleradors. Aquests acceleradors executen un parell d’instruccions fusionades. S’han considerat dos tipus de fusió d’instructions. Primer, combinem un parell de loads independents formant loads vectorials i els executem en una unitat vectorial. Segon, fusionem parells d’instruccions simples d’alu que són dependents i que s’executaran en una Interlock Collapsing ALU (ICALU). Per altra aquestes tecniques les hem evaluat conjuntament amb diverses optimizacions com list scheduling, load-store telescoping i hoisting de loads, entre d’altres. Aquesta proposta ha estat comparada amb un processador fora d’ordre. Tercer, hem proposat un processador fora d’ordre co-dissenyat efficient reduint-ne la complexitat en dos areas principals. En primer lloc, hem co-disenyat el mecanisme de commit per tal de permetre un eficient commit atòmic del superblocks. En aquesta solució hem substituït el ROB convencional, i en lloc hem introduït el Superblock Ordering Buffer (SOB). El SOB manté l’odre de programa a granularitat de superblock. L’estat del programa consisteix en registres i memòria. L’estat dels registres es manté en una taula per superblock, mentre que l’estat de memòria es guarda en un buffer i s’actulitza atòmicament. La segona gran area de reducció de complexitat considerarada és l’ús de FIFOs a la lògica d’issue. En aquest últim àmbit hem proposat una heurística de distribució que solventa les ineficiències de l’heurística basada en dependències anteriorment proposada. Finalment, i junt amb les FIFOs, s’ha proposat un mecanisme per alliberar les entrades de la FIFO anticipadament

    Dynamic Dependency Collapsing

    Get PDF
    In this dissertation, we explore the concept of dynamic dependency collapsing. Performance increases in computer architecture are always introduced by exploiting additional parallelism when the clock speed is fixed. We show that further improvements are possible even when the available parallelism in programs are exhausted. This performance improvement is possible due to executing instructions in parallel that would ordinarily have been serialized. We call this concept dependency collapsing. We explore existing techniques that exploit parallelism and show which of them fall under the umbrella of dependency collapsing. We then introduce two dependency collapsing techniques of our own. The first technique collapses data dependencies by executing two normally dependent instructions together by fusing them. We show that exploiting the additional parallelism generated by collapsing these dependencies results in a performance increase. Our second technique collapses resource dependencies to execute instructions that would normally have been serialized due to resource constraints in the processor. We show that it is possible to take advantage of larger in-processor structures while avoiding the power and area penalty this often implies

    OpenISA, um conjunto de instruções híbrido

    Get PDF
    Orientador: Edson BorinTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: OpenISA é concebido como a interface de processadores que pretendem ser altamente flexíveis. Isto é conseguido por meio de três estratégias: em primeiro lugar, o ISA é empiricamente escolhido para ser facilmente traduzido para outros, possibilitando flexibilidade do software no caso de um processador OpenISA físico não estar disponível. Neste caso, não há nenhuma necessidade de aplicar um processador virtual OpenISA em software. O ISA está preparado para ser estaticamente traduzido para outros ISAs. Segundo, o ISA não é um ISA concreto nem um ISA virtual, mas um híbrido com a capacidade de admitir modificações nos opcodes sem afetar a compatibilidade retroativa. Este mecanismo permite que as futuras versões do ISA possam sofrer modificações em vez de extensões simples das versões anteriores, um problema comum com ISA concretos, como o x86. Em terceiro lugar, a utilização de uma licença permissiva permite o ISA ser usado livremente por qualquer parte interessada no projeto. Nesta tese de doutorado, concentramo-nos nas instruções de nível de usuário do OpenISA. A tese discute (1) alternativas para ISAs, alternativas para distribuição de programas e o impacto de cada opção, (2) características importantes de OpenISA para atingir seus objetivos e (3) fornece uma completa avaliação do ISA escolhido com respeito a emulação de desempenho em duas CPUs populares, uma projetada pela Intel e outra pela ARM. Concluímos que a versão do OpenISA apresentada aqui pode preservar desempenho próximo do nativo quando traduzida para outros hospedeiros, funcionando como um modelo promissor para ISAs flexíveis da próxima geração que podem ser facilmente estendidos preservando a compatibilidade. Ainda, também mostramos como isso pode ser usado como um formato de distribuição de programas no nível de usuárioAbstract: OpenISA is designed as the interface of processors that aim to be highly flexible. This is achieved by means of three strategies: first, the ISA is empirically chosen to be easily translated to others, providing software flexibility in case a physical OpenISA processor is not available. Second, the ISA is not a concrete ISA nor a virtual ISA, but a hybrid one with the capability of admitting modifications to opcodes without impacting backwards compatibility. This mechanism allows future versions of the ISA to have real changes instead of simple extensions of previous versions, a common problem with concrete ISAs such as the x86. Third, the use of a permissive license allows the ISA to be freely used by any party interested in the project. In this PhD. thesis, we focus on the user-level instructions of OpenISA. The thesis discusses (1) ISA alternatives, program distribution alternatives and the impact of each choice, (2) important features of OpenISA to achieve its goals and (3) provides a thorough evaluation of the chosen ISA with respect to emulation performance on two popular host CPUs, one from Intel and another from ARM. We conclude that the version of OpenISA presented here can preserve close-to-native performance when translated to other hosts, working as a promising model for next-generation, flexible ISAs that can be easily extended while preserving backwards compatibility. Furthermore, we show how this can also be a program distribution format at user-levelDoutoradoCiência da ComputaçãoDoutor em Ciência da Computação2011/09630-1FAPES

    Simple, Parallel, High-Performance Virtual Machines for Extreme Computations

    Full text link
    We introduce a high-performance virtual machine (VM) written in a numerically fast language like Fortran or C to evaluate very large expressions. We discuss the general concept of how to perform computations in terms of a VM and present specifically a VM that is able to compute tree-level cross sections for any number of external legs, given the corresponding byte code from the optimal matrix element generator, O'Mega. Furthermore, this approach allows to formulate the parallel computation of a single phase space point in a simple and obvious way. We analyze hereby the scaling behaviour with multiple threads as well as the benefits and drawbacks that are introduced with this method. Our implementation of a VM can run faster than the corresponding native, compiled code for certain processes and compilers, especially for very high multiplicities, and has in general runtimes in the same order of magnitude. By avoiding the tedious compile and link steps, which may fail for source code files of gigabyte sizes, new processes or complex higher order corrections that are currently out of reach could be evaluated with a VM given enough computing power.Comment: 19 pages, 8 figure
    • …
    corecore