25 research outputs found

    HW/SW mechanisms for instruction fusion, issue and commit in modern u-processors

    Get PDF
    In this thesis we have explored the co-designed paradigm to show alternative processor design points. Specifically, we have provided HW/SW mechanisms for instruction fusion, issue and commit for modern processors. We have implemented a co-designed virtual machine monitor that binary translates x86 instructions into RISC like micro-ops. Moreover, the translations are stored as superblocks, which are a trace of basic blocks. These superblocks are further optimized using speculative and non-speculative optimizations. Hardware mechanisms exists in-order to take corrective action in case of misspeculations. During the course of this PhD we have made following contributions. Firstly, we have provided a novel Programmable Functional unit, in-order to speed up general-purpose applications. The PFU consists of a grid of functional units, similar to CCA, and a distributed internal register file. The inputs of the macro-op are brought from the Physical Register File to the internal register file using a set of moves and a set of loads. A macro-op fusion algorithm fuses micro-ops at runtime. The fusion algorithm is based on a scheduling step that indicates whether the current fused instruction is beneficial or not. The micro-ops corresponding to the macro-ops are stored as control signals in a configuration. The macro-op consists of a configuration ID which helps in locating the configurations. A small configuration cache is present inside the Programmable Functional unit, that holds these configurations. In case of a miss in the configuration cache configurations are loaded from I-Cache. Moreover, in-order to support bulk commit of atomic superblocks that are larger than the ROB we have proposed a speculative commit mechanism. For this we have proposed a Speculative commit register map table that holds the mappings of the speculatively committed instructions. When all the instructions of the superblock have committed the speculative state is copied to Backend Register Rename Table. Secondly, we proposed a co-designed in-order processor with with two kinds of accelerators. These FU based accelerators run a pair of fused instructions. We have considered two kinds of instruction fusion. First, we fused a pair of independent loads together into vector loads and execute them on vector load units. For the second kind of instruction fusion we have fused a pair of dependent simple ALU instructions and execute them in Interlock Collapsing ALUs (ICALU). Moreover, we have evaluated performance of various code optimizations such as list-scheduling, load-store telescoping and load hoisting among others. We have compared our co-designed processor with small instruction window out-of-order processors. Thirdly, we have proposed a co-designed out-of-order processor. Specifically we have reduced complexity in two areas. First of all, we have co-designed the commit mechanism, that enable bulk commit of atomic superblocks. In this solution we got rid of the conventional ROB, instead we introduce the Superblock Ordering Buffer (SOB). SOB ensures program order is maintained at the granularity of the superblock, by bulk committing the program state. The program state consists of the register state and the memory state. The register state is held in a per superblock register map table, whereas the memory state is held in gated store buffer and updated in bulk. Furthermore, we have tackled the complexity of Out-of-Order issue logic by using FIFOs. We have proposed an enhanced steering heuristic that fixes the inefficiencies of the existing dependence-based heuristic. Moreover, a mechanism to release the FIFO entries earlier is also proposed that further improves the performance of the steering heuristic.En aquesta tesis hem explorat el paradigma de les m脿quines issue i commit per processadors actuals. Hem implementat una m脿quina virtual que tradueix binaris x86 a micro-ops de tipus RISC. Aquestes traduccions es guarden com a superblocks, que en realitat no 茅s m茅s que una tra莽a de virtuals co-dissenyades. En particular, hem proposat mecanismes hw/sw per a la fusi贸 d鈥檌nstruccions, blocs b脿sics. Aquests superblocks s鈥檕ptimitzen utilitzant optimizacions especualtives i d鈥檃ltres no speculatives. En cas de les optimizations especulatives es consideren mecanismes per a la gesti贸 de errades en l鈥檈speculaci贸. Al llarg d鈥檃questa tesis s鈥檋an fet les seg眉ents contribucions: Primer, hem proposat una nova unitat functional programmable (PFU) per tal de millorar l鈥檈xecuci贸 d鈥檃plicacions de proposit general. La PFU est脿 formada per un conjunt d鈥檜nitats funcionals, similar al CCA, amb un banc de registres intern a la PFU distribu茂t a les unitats funcionals que la composen. Les entrades de la macro-operaci贸 que s鈥檈xecuta en la PFU es mouen del banc de registres f铆sic convencional al intern fent servir un conjunt de moves i loads. Un algorisme de fusi贸 combina m茅s micro-operacions en temps d鈥檈xecuci贸. Aquest algorisme es basa en un pas de planificaci贸 que mesura el benefici de les decisions de fusi贸. Les micro operacions corresponents a la macro operaci贸 s鈥檈mmagatzemen com a senyals de control en una configuraci贸. Les macro-operacions tenen associat un identificador de configuraci贸 que ajuda a localitzar d鈥檃questes. Una petita cache de configuracions est脿 present dintre de la PFU per tal de guardar-les. En cas de que la configuraci贸 no estigui a la cache, les configuracions es carreguen de la cache d鈥檌nstruccions. Per altre banda, per tal de donar support al commit at貌mic dels superblocks que sobrepassen el tamany del ROB s鈥檋a proposat un mecanisme de commit especulatiu. Per aquest mecanisme hem proposat una taula de mapeig especulativa dels registres, que es copia a la taula no especulativa quan totes les instruccions del superblock han comitejat. Segon, hem proposat un processador en order co-dissenyat que combina dos tipus d鈥檃cceleradors. Aquests acceleradors executen un parell d鈥檌nstruccions fusionades. S鈥檋an considerat dos tipus de fusi贸 d鈥檌nstructions. Primer, combinem un parell de loads independents formant loads vectorials i els executem en una unitat vectorial. Segon, fusionem parells d鈥檌nstruccions simples d鈥檃lu que s贸n dependents i que s鈥檈xecutaran en una Interlock Collapsing ALU (ICALU). Per altra aquestes tecniques les hem evaluat conjuntament amb diverses optimizacions com list scheduling, load-store telescoping i hoisting de loads, entre d鈥檃ltres. Aquesta proposta ha estat comparada amb un processador fora d鈥檕rdre. Tercer, hem proposat un processador fora d鈥檕rdre co-dissenyat efficient reduint-ne la complexitat en dos areas principals. En primer lloc, hem co-disenyat el mecanisme de commit per tal de permetre un eficient commit at貌mic del superblocks. En aquesta soluci贸 hem substitu茂t el ROB convencional, i en lloc hem introdu茂t el Superblock Ordering Buffer (SOB). El SOB mant茅 l鈥檕dre de programa a granularitat de superblock. L鈥檈stat del programa consisteix en registres i mem貌ria. L鈥檈stat dels registres es mant茅 en una taula per superblock, mentre que l鈥檈stat de mem貌ria es guarda en un buffer i s鈥檃ctulitza at貌micament. La segona gran area de reducci贸 de complexitat considerarada 茅s l鈥櫭簊 de FIFOs a la l貌gica d鈥檌ssue. En aquest 煤ltim 脿mbit hem proposat una heur铆stica de distribuci贸 que solventa les inefici猫ncies de l鈥檋eur铆stica basada en depend猫ncies anteriorment proposada. Finalment, i junt amb les FIFOs, s鈥檋a proposat un mecanisme per alliberar les entrades de la FIFO anticipadament

    Design methodologies for instruction-set extensible processors

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Defining interfaces between hardware and software: Quality and performance

    Get PDF
    One of the most important interfaces in a computer system is the interface between hardware and software. This interface is the contract between the hardware designer and the programmer that defines the functional behaviour of the hardware. This thesis examines two critical aspects of defining the hardware-software interface: quality and performance. The first aspect is creating a high quality specification of the interface as conventionally defined in an instruction set architecture. The majority of this thesis is concerned with creating a specification that covers the full scope of the interface; that is applicable to all current implementations of the architecture; and that can be trusted to accurately describe the behaviour of implementations of the architecture. We describe the development of a formal specification of the two major types of Arm processors: A-class (for mobile devices such as phones and tablets) and M-class (for micro-controllers). These specifications are unparalleled in their scope, applicability and trustworthiness. This thesis identifies and illustrates what we consider the key ingredient in achieving this goal: creating a specification that is used by many different user groups. Supporting many different groups leads to improved quality as each group finds different problems in the specification; and, by providing value to each different group, it helps justify the considerable effort required to create a high quality specification of a major processor architecture. The work described in this thesis led to a step change in Arm's ability to use formal verification techniques to detect errors in their processors; enabled extensive testing of the specification against Arm's official architecture conformance suite; improved the quality of Arm's architecture conformance suite based on measuring the architectural coverage of the tests; supported earlier, faster development of architecture extensions by enabling animation of changes as they are being made; and enabled early detection of problems created from architecture extensions by performing formal validation of the specification against semi-structured natural language specifications. As far as we are aware, no other mainstream processor architecture has this capability. The formal specifications are included in Arm's publicly released architecture reference manuals and the A-class specification is also released in machine-readable form. The second aspect is creating a high performance interface by defining the hardware-software interface of a software-defined radio subsystem using a programming language. That is, an interface that allows software to exploit the potential performance of the underlying hardware. While the hardware-software interface is normally defined in terms of machine code, peripheral control registers and memory maps, we define it using a programming language instead. This higher level interface provides the opportunity for compilers to hide some of the low-level differences between different systems from the programmer: a potentially very efficient way of providing a stable, portable interface without having to add hardware to provide portability between different hardware platforms. We describe the design and implementation of a set of extensions to the C programming language to support programming high performance, energy efficient, software defined radio systems. The language extensions enable the programmer to exploit the pipeline parallelism typically present in digital signal processing applications and to make efficient use of the asymmetric multiprocessor systems designed to support such applications. The extensions consist primarily of annotations that can be checked for consistency and that support annotation inference in order to reduce the number of annotations required. Reducing the number of annotations does not just save programmer effort, it also improves portability by reducing the number of annotations that need to be changed when porting an application from one platform to another. This work formed part of a project that developed a high-performance, energy-efficient, software defined radio capable of implementing the physical layers of the 4G cellphone standard (LTE), 802.11a WiFi and Digital Video Broadcast (DVB) with a power and silicon area budget that was competitive with a conventional custom ASIC solution. The Arm architecture is the largest computer architecture by volume in the world. It behooves us to ensure that the interface it describes is appropriately defined
    corecore