7 research outputs found

    Speculative dynamic vectorization to assist static vectorization in a HW/SW co-designed environment

    Get PDF

    Optimizing SIMD execution in HW/SW co-designed processors

    Get PDF
    SIMD accelerators are ubiquitous in microprocessors from different computing domains. Their high compute power and hardware simplicity improve overall performance in an energy efficient manner. Moreover, their replicated functional units and simple control mechanism make them amenable to scaling to higher vector lengths. However, code generation for these accelerators has been a challenge from the days of their inception. Compilers generate vector code conservatively to ensure correctness. As a result they lose significant vectorization opportunities and fail to extract maximum benefits out of SIMD accelerators. This thesis proposes to vectorize the program binary at runtime in a speculative manner, in addition to the compile time static vectorization. There are different environments that support runtime profiling and optimization support required for dynamic vectorization, one of most prominent ones being: 1) Dynamic Binary Translators and Optimizers (DBTO) and 2) Hardware/Software (HW/SW) Co-designed Processors. HW/SW co-designed environment provides several advantages over DBTOs like transparent incorporations of new hardware features, binary compatibility, etc. Therefore, we use HW/SW co-designed environment to assess the potential of speculative dynamic vectorization. Furthermore, we analyze vector code generation for wider vector units and find out that even though SIMD accelerators are amenable to scaling from the hardware point of view, vector code generation at higher vector length is even more challenging. The two major factors impeding vectorization for wider SIMD units are: 1) Reduced dynamic instruction stream coverage for vectorization and 2) Large number of permutation instructions. To solve the first problem we propose Variable Length Vectorization that iteratively vectorizes for multiple vector lengths to improve dynamic instruction stream coverage. Secondly, to reduce the number of permutation instructions we propose Selective Writing that selectively writes to different parts of a vector register and avoids permutations. Finally, we tackle the problem of leakage energy in SIMD accelerators. Since SIMD accelerators consume significant amount of real estate on the chip, they become the principle source of leakage if not utilized judiciously. Power gating is one of the most widely used techniques to reduce leakage energy of functional units. However, power gating has its own energy and performance overhead associated with it. We propose to selectively devectorize the vector code when higher SIMD lanes are used intermittently. This selective devectorization keeps the higher SIMD lanes idle and power gated for maximum duration. Therefore, resulting in overall leakage energy reduction.Postprint (published version

    HW/SW mechanisms for instruction fusion, issue and commit in modern u-processors

    Get PDF
    In this thesis we have explored the co-designed paradigm to show alternative processor design points. Specifically, we have provided HW/SW mechanisms for instruction fusion, issue and commit for modern processors. We have implemented a co-designed virtual machine monitor that binary translates x86 instructions into RISC like micro-ops. Moreover, the translations are stored as superblocks, which are a trace of basic blocks. These superblocks are further optimized using speculative and non-speculative optimizations. Hardware mechanisms exists in-order to take corrective action in case of misspeculations. During the course of this PhD we have made following contributions. Firstly, we have provided a novel Programmable Functional unit, in-order to speed up general-purpose applications. The PFU consists of a grid of functional units, similar to CCA, and a distributed internal register file. The inputs of the macro-op are brought from the Physical Register File to the internal register file using a set of moves and a set of loads. A macro-op fusion algorithm fuses micro-ops at runtime. The fusion algorithm is based on a scheduling step that indicates whether the current fused instruction is beneficial or not. The micro-ops corresponding to the macro-ops are stored as control signals in a configuration. The macro-op consists of a configuration ID which helps in locating the configurations. A small configuration cache is present inside the Programmable Functional unit, that holds these configurations. In case of a miss in the configuration cache configurations are loaded from I-Cache. Moreover, in-order to support bulk commit of atomic superblocks that are larger than the ROB we have proposed a speculative commit mechanism. For this we have proposed a Speculative commit register map table that holds the mappings of the speculatively committed instructions. When all the instructions of the superblock have committed the speculative state is copied to Backend Register Rename Table. Secondly, we proposed a co-designed in-order processor with with two kinds of accelerators. These FU based accelerators run a pair of fused instructions. We have considered two kinds of instruction fusion. First, we fused a pair of independent loads together into vector loads and execute them on vector load units. For the second kind of instruction fusion we have fused a pair of dependent simple ALU instructions and execute them in Interlock Collapsing ALUs (ICALU). Moreover, we have evaluated performance of various code optimizations such as list-scheduling, load-store telescoping and load hoisting among others. We have compared our co-designed processor with small instruction window out-of-order processors. Thirdly, we have proposed a co-designed out-of-order processor. Specifically we have reduced complexity in two areas. First of all, we have co-designed the commit mechanism, that enable bulk commit of atomic superblocks. In this solution we got rid of the conventional ROB, instead we introduce the Superblock Ordering Buffer (SOB). SOB ensures program order is maintained at the granularity of the superblock, by bulk committing the program state. The program state consists of the register state and the memory state. The register state is held in a per superblock register map table, whereas the memory state is held in gated store buffer and updated in bulk. Furthermore, we have tackled the complexity of Out-of-Order issue logic by using FIFOs. We have proposed an enhanced steering heuristic that fixes the inefficiencies of the existing dependence-based heuristic. Moreover, a mechanism to release the FIFO entries earlier is also proposed that further improves the performance of the steering heuristic.En aquesta tesis hem explorat el paradigma de les m脿quines issue i commit per processadors actuals. Hem implementat una m脿quina virtual que tradueix binaris x86 a micro-ops de tipus RISC. Aquestes traduccions es guarden com a superblocks, que en realitat no 茅s m茅s que una tra莽a de virtuals co-dissenyades. En particular, hem proposat mecanismes hw/sw per a la fusi贸 d鈥檌nstruccions, blocs b脿sics. Aquests superblocks s鈥檕ptimitzen utilitzant optimizacions especualtives i d鈥檃ltres no speculatives. En cas de les optimizations especulatives es consideren mecanismes per a la gesti贸 de errades en l鈥檈speculaci贸. Al llarg d鈥檃questa tesis s鈥檋an fet les seg眉ents contribucions: Primer, hem proposat una nova unitat functional programmable (PFU) per tal de millorar l鈥檈xecuci贸 d鈥檃plicacions de proposit general. La PFU est脿 formada per un conjunt d鈥檜nitats funcionals, similar al CCA, amb un banc de registres intern a la PFU distribu茂t a les unitats funcionals que la composen. Les entrades de la macro-operaci贸 que s鈥檈xecuta en la PFU es mouen del banc de registres f铆sic convencional al intern fent servir un conjunt de moves i loads. Un algorisme de fusi贸 combina m茅s micro-operacions en temps d鈥檈xecuci贸. Aquest algorisme es basa en un pas de planificaci贸 que mesura el benefici de les decisions de fusi贸. Les micro operacions corresponents a la macro operaci贸 s鈥檈mmagatzemen com a senyals de control en una configuraci贸. Les macro-operacions tenen associat un identificador de configuraci贸 que ajuda a localitzar d鈥檃questes. Una petita cache de configuracions est脿 present dintre de la PFU per tal de guardar-les. En cas de que la configuraci贸 no estigui a la cache, les configuracions es carreguen de la cache d鈥檌nstruccions. Per altre banda, per tal de donar support al commit at貌mic dels superblocks que sobrepassen el tamany del ROB s鈥檋a proposat un mecanisme de commit especulatiu. Per aquest mecanisme hem proposat una taula de mapeig especulativa dels registres, que es copia a la taula no especulativa quan totes les instruccions del superblock han comitejat. Segon, hem proposat un processador en order co-dissenyat que combina dos tipus d鈥檃cceleradors. Aquests acceleradors executen un parell d鈥檌nstruccions fusionades. S鈥檋an considerat dos tipus de fusi贸 d鈥檌nstructions. Primer, combinem un parell de loads independents formant loads vectorials i els executem en una unitat vectorial. Segon, fusionem parells d鈥檌nstruccions simples d鈥檃lu que s贸n dependents i que s鈥檈xecutaran en una Interlock Collapsing ALU (ICALU). Per altra aquestes tecniques les hem evaluat conjuntament amb diverses optimizacions com list scheduling, load-store telescoping i hoisting de loads, entre d鈥檃ltres. Aquesta proposta ha estat comparada amb un processador fora d鈥檕rdre. Tercer, hem proposat un processador fora d鈥檕rdre co-dissenyat efficient reduint-ne la complexitat en dos areas principals. En primer lloc, hem co-disenyat el mecanisme de commit per tal de permetre un eficient commit at貌mic del superblocks. En aquesta soluci贸 hem substitu茂t el ROB convencional, i en lloc hem introdu茂t el Superblock Ordering Buffer (SOB). El SOB mant茅 l鈥檕dre de programa a granularitat de superblock. L鈥檈stat del programa consisteix en registres i mem貌ria. L鈥檈stat dels registres es mant茅 en una taula per superblock, mentre que l鈥檈stat de mem貌ria es guarda en un buffer i s鈥檃ctulitza at貌micament. La segona gran area de reducci贸 de complexitat considerarada 茅s l鈥櫭簊 de FIFOs a la l貌gica d鈥檌ssue. En aquest 煤ltim 脿mbit hem proposat una heur铆stica de distribuci贸 que solventa les inefici猫ncies de l鈥檋eur铆stica basada en depend猫ncies anteriorment proposada. Finalment, i junt amb les FIFOs, s鈥檋a proposat un mecanisme per alliberar les entrades de la FIFO anticipadament

    Performance simulation methodologies for hardware/software co-designed processors

    Get PDF
    Recently the community started looking into Hardware/Software (HW/SW) co-designed processors as potential solutions to move towards the less power consuming and the less complex designs. Unlike other solutions, they reduce the power and the complexity doing so called dynamic binary translation and optimization from a guest ISA to an internal host custom ISA. This thesis tries to answer the question on how to simulate this kind of architectures. For any kind of processor's architecture, the simulation is the common practice, because it is impossible to build several versions of hardware in order to try all alternatives. The simulation of HW/SW co-designed processors has a big issue in comparison with the simulation of traditional HW-only architectures. First of all, open source tools do not exist. Therefore researches many times assume that the software layer overhead, which is in charge for dynamic binary translation and optimization, is constant or ignored. In this thesis we show that such an assumption is not valid and that can lead to very inaccurate results. Therefore including the software layer in the simulation is a must. On the other side, the simulation is very slow in comparison to native execution, so the community spent a big effort on delivering accurate results in a reasonable amount of time. Therefore it is the common practice for HW-only processors that only parts of application stream, which are called samples, are simulated. Samples usually correspond to different phases in the application stream and usually they are no longer than a few million of instructions. In order to archive accurate starting state of each sample, microarchitectural structures are warmed-up for a few million instructions prior to samples instructions. Unfortunately, such a methodology cannot be directly applied for HW/SW co-designed processors. The warm-up for HW/SW co-designed processors needs to be 3-4 orders of magnitude longer than the warm-up needed for traditional HW-only processor, because the warm-up of software layer needs to be longer than the warm-up of hardware structures. To overcome such a problem, in this thesis we propose a novel warm-up technique specialized for HW/SW co-designed processors. Our solution reduces the simulation time by at least 65X with an average error of just 0.75\%. Such a trend is visible for different software and hardware configurations. The process used to determine simulation samples cannot be applied to HW/SW co-designed processors as well, because due to the software layer, samples show more dissimilarities than in the case of HW-only processors. Therefore we propose a novel algorithm that needs 3X less number of samples to achieve similar error like the state of the art algorithms. Again, such a trend is visible for different software and hardware configurations.Els processadors co-dissenyats Hardware/Software (HW/SW co-designed processors) han estat proposats per l'acad猫mia i la ind煤stria com a solucions potencials per a fabricar processadors menys complexos i que consumeixen menys energia. A difer猫ncia d'altres alternatives, aquest tipus de processadors redueixen la complexitat i el consum d'energia aplicant traducci贸 y optimitzaci贸 din脿mica de binaris des d'un repertori d'instruccions (instruction set architecture) extern cap a un repertori d'instruccions intern adaptat. Aquesta tesi intenta resoldre els reptes relacionats a la simulaci贸 d'aquest tipus d'arquitectures. La simulaci贸 茅s un proc茅s com煤 en el disseny i desenvolupament de processadors ja que permet explorar diverses alternatives sense haver de fabricar el hardware per a cadascuna d'elles. La simulaci贸 de processadors co-dissenyats Hardware/Software 茅s un proc茅s m茅s complex que la simulaci贸 de processadores tradicionals, purament hardware. Per exemple, no existeixen eines de simulaci贸 disponibles per a la comunitat. Per tant, els investigadors acostumen a assumir que la capa de software, que s'encarrega de la traducci贸 i optimitzaci贸 de les aplicacions, no t茅 un pes espec铆fic i, per tant, uns costos computacionals baixos o constants en el millor dels casos. En aquesta tesis demostrem que aquestes premisses s贸n incorrectes i que els resultats amb aquestes acostumen a ser molt imprecisos. Una primera conclusi贸 d'aquesta tesi doncs 茅s que la simulaci贸 de la capa software 茅s totalment necess脿ria. A m茅s a m茅s, degut a que els processos de simulaci贸 s贸n lents, s'han proposat t猫cniques de simulaci贸 que intenten obtenir resultats precisos en el menor temps possible. Una pr脿ctica habitual 茅s la simulaci贸 nom茅s de parts de les aplicacions, anomenades mostres, en el disseny de processadors convencionals, purament hardware. Aquestes mostres corresponen a diferents fases de les aplicacions i acostumen a ser de pocs milions d'instruccions. Per tal d'aconseguir un estat microarquitect貌nic acurat per a cadascuna de les mostres, s'acostumen a estressar aquestes estructures microarquitect貌niques del simulador abans de comen莽ar a extreure resultats, proc茅s anomenat "escalfament" (warm-up). Desafortunadament, aquesta metodologia no pot ser aplicada a processadors co-dissenyats Hardware/Software. L'"escalfament" de les estructures internes del simulador en el disseny de processadores co-dissenyats Hardware/Software s贸n 3-4 ordres de magnitud m茅s gran que el mateix proc茅s d' "escalfament" en simulacions de processadors convencionals, ja que en els primers cal "escalfar" tamb茅 les estructures i l'estat de la capa software. En aquesta tesi proposem t猫cniques de simulaci贸 basades en l' "escalfament" de les estructures que redueixen el temps de simulaci贸 en 65X amb un error mig del 0,75%. Aquests resultats s贸n extrapolables a diferents configuracions del hardware i de la capa software. Finalment, les t猫cniques convencionals de selecci贸 de mostres d'aplicacions a simular no s贸n aplicables tampoc a la simulaci贸 de processadors co-dissenyats Hardware/Software degut a que les mostres es comporten de manera molt diferent quan es t茅 en compte la capa software. En aquesta tesi, proposem un nou algorisme que redueix 3X el nombre de mostres a simular comparat amb els algorismes tradicionals per a processadors convencionals per a obtenir un error similar. Aquests resultats tamb茅 s贸n extrapolables a diferents configuracions de hardware i de software. En conclusi贸, en aquesta tesi es respon al repte de com simular processadors co-dissenyats Hardware/Software, que s贸n una alternativa al disseny tradicional de processadors. Hem demostrat que cal simular la capa software i s'han proposat noves t猫cniques i algorismes eficients d' "escalfament" i selecci贸 de mostres que s贸n tolerants a diferents configuracion

    Software instruction caching

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 185-193).As microprocessor complexities and costs skyrocket, designers are looking for ways to simplify their designs to reduce costs, improve energy efficiency, or squeeze more computational elements on each chip. This is particularly true for the embedded domain where cost and energy consumption are paramount. Software instruction caches have the potential to provide the required performance while using simpler, more efficient hardware. A software cache consists of a simple array memory (such as a scratchpad) and a software system that is capable of automatically managing that memory as a cache. Software caches have several advantages over traditional hardware caches. Without complex cache-management logic, the processor hardware is cheaper and easier to design, verify and manufacture. The reduced access energy of simple memories can result in a net energy savings if management overhead is kept low. Software caches can also be customized to each individual program's needs, improving performance or eliminating unpredictable timing for real-time embedded applications. The greatest challenge for a software cache is providing good performance using general-purpose instructions for cache management rather than specially-designed hardware. This thesis designs and implements a working system (Flexicache) on an actual embedded processor and uses it to investigate the strengths and weaknesses of software instruction caches. Although both data and instruction caches can be implemented in software, very different techniques are used to optimize performance; this work focuses exclusively on software instruction caches. The Flexicache system consists of two software components: a static off-line preprocessor to add caching to an application and a dynamic runtime system to manage memory during execution. Key interfaces and optimizations are identified and characterized. The system is evaluated in detail from the standpoints of both performance and energy consumption. The results indicate that software instruction caches can perform comparably to hardware caches in embedded processors. On most benchmarks, the overhead relative to a hardware cache is less than 12% and can be as low as 2.4%. At the same time, the software cache uses up to 6% less energy. This is achieved using a simple, directly-addressed memory and without requiring any complex, specialized hardware structures.by Jason Eric Miller.Ph.D

    Efficient, transparent, and comprehensive runtime code manipulation

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.Includes bibliographical references (p. 293-306).This thesis addresses the challenges of building a software system for general-purpose runtime code manipulation. Modern applications, with dynamically-loaded modules and dynamically-generated code, are assembled at runtime. While it was once feasible at compile time to observe and manipulate every instruction--which is critical for program analysis, instrumentation, trace gathering, optimization, and similar tools--it can now only be done at runtime. Existing runtime tools are successful at inserting instrumentation calls, but no general framework has been developed for fine-grained and comprehensive code observation and modification without high overheads. This thesis demonstrates the feasibility of building such a system in software. We present DynamoRIO, a fully-implemented runtime code manipulation system that supports code transformations on any part of a program, while it executes. DynamoRIO uses code caching technology to provide efficient, transparent, and comprehensive manipulation of an unmodified application running on a stock operating system and commodity hardware. DynamoRIO executes large, complex, modern applications with dynamically-loaded, generated, or even modified code. Despite the formidable obstacles inherent in the IA-32 architecture, DynamoRIO provides these capabilities efficiently, with zero to thirty percent time and memory overhead on both Windows and Linux. DynamoRIO exports an interface for building custom runtime code manipulation tools of all types. It has been used by many researchers, with several hundred downloads of our public release, and is being commercialized in a product for protection against remote security exploits, one of numerous applications of runtime code manipulation.by Derek L. Bruening.Ph.D
    corecore