608 research outputs found

    SAMIE-LSQ: set-associative multiple-instruction entry load/store queue

    Get PDF
    The load/store queue (LSQ) is one of the most complex parts of contemporary processors. Its latency is critical for the processor performance and it is usually one of the processor hotspots. This paper presents a highly banked, set-associative, multiple-instruction entry LSQ (SAMIE-LSQ,) that achieves high performance with small energy requirements. The SAMIE-LSQ classifies the memory instructions (loads and stores) based on the address to be accessed, and groups those instructions accessing the same cache line in the same entry. Our approach relies on the fact that many in-flight memory instructions access the same cache lines. Each SAMIE-LSQ entry has space for several memory instructions accessing the same cache line. This arrangement has a number of advantages. First, it significantly reduces the address comparison activity needed for memory disambiguation since there are less addresses to be compared. It also reduces the activity in the data TLB, the cache tag and cache data arrays. This is achieved by caching the cache line location and address translation in the corresponding SAMIE-LSQ entry once the access of one of the instructions in an entry is performed, so instructions that share an entry can reuse the translation, avoid the tag check and get the data directly from the concrete cache way without checking the others. Besides, the delay of the proposed scheme is lower than that required by a conventional LSQ. We show that the SAMIE-LSQ saves 82% dynamic energy for the load/store queue, 42% for the LI data cache and 73% for the data TLB, with a negligible impact on performance (0.6%)Peer ReviewedPostprint (published version

    Design of a distributed memory unit for clustered microarchitectures

    Get PDF
    Power constraints led to the end of exponential growth in single–processor performance, which characterized the semiconductor industry for many years. Single–chip multiprocessors allowed the performance growth to continue so far. Yet, Amdahl’s law asserts that the overall performance of future single–chip multiprocessors will depend crucially on single–processor performance. In a multiprocessor a small growth in single–processor performance can justify the use of significant resources. Partitioning the layout of critical components can improve the energy–efficiency and ultimately the performance of a single processor. In a clustered microarchitecture parts of these components form clusters. Instructions are processed locally in the clusters and benefit from the smaller size and complexity of the clusters components. Because the clusters together process a single instruction stream communications between clusters are necessary and introduce an additional cost. This thesis proposes the design of a distributed memory unit and first level cache in the context of a clustered microarchitecture. While the partitioning of other parts of the microarchitecture has been well studied the distribution of the memory unit and the cache has received comparatively little attention. The first proposal consists of a set of cache bank predictors. Eight different predictor designs are compared based on cost and accuracy. The second proposal is the distributed memory unit. The load and store queues are split into smaller queues for distributed disambiguation. The mapping of memory instructions to cache banks is delayed until addresses have been calculated. We show how disambiguation can be implemented efficiently with unordered queues. A bank predictor is used to map instructions that consume memory data near the data origin. We show that this organization significantly reduces both energy usage and latency. The third proposal introduces Dispatch Throttling and Pre-Access Queues. These mechanisms avoid load/store queue overflows that are a result of the late allocation of entries. The fourth proposal introduces Memory Issue Queues, which add functionality to select instructions for execution and re-execution to the memory unit. The fifth proposal introduces Conservative Deadlock Aware Entry Allocation. This mechanism is a deadlock safe issue policy for the Memory Issue Queues. Deadlocks can result from certain queue allocations because entries are allocated out-of-order instead of in-order like in traditional architectures. The sixth proposal is the Early Release of Load Queue Entries. Architectures with weak memory ordering such as Alpha, PowerPC or ARMv7 can take advantage of this mechanism to release load queue entries before the commit stage. Together, these proposals allow significantly smaller and more energy efficient load queues without the need of energy hungry recovery mechanisms and without performance penalties. Finally, we present a detailed study that compares the proposed distributed memory unit to a centralized memory unit and confirms its advantages of reduced energy usage and of improved performance

    HW/SW mechanisms for instruction fusion, issue and commit in modern u-processors

    Get PDF
    In this thesis we have explored the co-designed paradigm to show alternative processor design points. Specifically, we have provided HW/SW mechanisms for instruction fusion, issue and commit for modern processors. We have implemented a co-designed virtual machine monitor that binary translates x86 instructions into RISC like micro-ops. Moreover, the translations are stored as superblocks, which are a trace of basic blocks. These superblocks are further optimized using speculative and non-speculative optimizations. Hardware mechanisms exists in-order to take corrective action in case of misspeculations. During the course of this PhD we have made following contributions. Firstly, we have provided a novel Programmable Functional unit, in-order to speed up general-purpose applications. The PFU consists of a grid of functional units, similar to CCA, and a distributed internal register file. The inputs of the macro-op are brought from the Physical Register File to the internal register file using a set of moves and a set of loads. A macro-op fusion algorithm fuses micro-ops at runtime. The fusion algorithm is based on a scheduling step that indicates whether the current fused instruction is beneficial or not. The micro-ops corresponding to the macro-ops are stored as control signals in a configuration. The macro-op consists of a configuration ID which helps in locating the configurations. A small configuration cache is present inside the Programmable Functional unit, that holds these configurations. In case of a miss in the configuration cache configurations are loaded from I-Cache. Moreover, in-order to support bulk commit of atomic superblocks that are larger than the ROB we have proposed a speculative commit mechanism. For this we have proposed a Speculative commit register map table that holds the mappings of the speculatively committed instructions. When all the instructions of the superblock have committed the speculative state is copied to Backend Register Rename Table. Secondly, we proposed a co-designed in-order processor with with two kinds of accelerators. These FU based accelerators run a pair of fused instructions. We have considered two kinds of instruction fusion. First, we fused a pair of independent loads together into vector loads and execute them on vector load units. For the second kind of instruction fusion we have fused a pair of dependent simple ALU instructions and execute them in Interlock Collapsing ALUs (ICALU). Moreover, we have evaluated performance of various code optimizations such as list-scheduling, load-store telescoping and load hoisting among others. We have compared our co-designed processor with small instruction window out-of-order processors. Thirdly, we have proposed a co-designed out-of-order processor. Specifically we have reduced complexity in two areas. First of all, we have co-designed the commit mechanism, that enable bulk commit of atomic superblocks. In this solution we got rid of the conventional ROB, instead we introduce the Superblock Ordering Buffer (SOB). SOB ensures program order is maintained at the granularity of the superblock, by bulk committing the program state. The program state consists of the register state and the memory state. The register state is held in a per superblock register map table, whereas the memory state is held in gated store buffer and updated in bulk. Furthermore, we have tackled the complexity of Out-of-Order issue logic by using FIFOs. We have proposed an enhanced steering heuristic that fixes the inefficiencies of the existing dependence-based heuristic. Moreover, a mechanism to release the FIFO entries earlier is also proposed that further improves the performance of the steering heuristic.En aquesta tesis hem explorat el paradigma de les màquines issue i commit per processadors actuals. Hem implementat una màquina virtual que tradueix binaris x86 a micro-ops de tipus RISC. Aquestes traduccions es guarden com a superblocks, que en realitat no és més que una traça de virtuals co-dissenyades. En particular, hem proposat mecanismes hw/sw per a la fusió d’instruccions, blocs bàsics. Aquests superblocks s’optimitzen utilitzant optimizacions especualtives i d’altres no speculatives. En cas de les optimizations especulatives es consideren mecanismes per a la gestió de errades en l’especulació. Al llarg d’aquesta tesis s’han fet les següents contribucions: Primer, hem proposat una nova unitat functional programmable (PFU) per tal de millorar l’execució d’aplicacions de proposit general. La PFU està formada per un conjunt d’unitats funcionals, similar al CCA, amb un banc de registres intern a la PFU distribuït a les unitats funcionals que la composen. Les entrades de la macro-operació que s’executa en la PFU es mouen del banc de registres físic convencional al intern fent servir un conjunt de moves i loads. Un algorisme de fusió combina més micro-operacions en temps d’execució. Aquest algorisme es basa en un pas de planificació que mesura el benefici de les decisions de fusió. Les micro operacions corresponents a la macro operació s’emmagatzemen com a senyals de control en una configuració. Les macro-operacions tenen associat un identificador de configuració que ajuda a localitzar d’aquestes. Una petita cache de configuracions està present dintre de la PFU per tal de guardar-les. En cas de que la configuració no estigui a la cache, les configuracions es carreguen de la cache d’instruccions. Per altre banda, per tal de donar support al commit atòmic dels superblocks que sobrepassen el tamany del ROB s’ha proposat un mecanisme de commit especulatiu. Per aquest mecanisme hem proposat una taula de mapeig especulativa dels registres, que es copia a la taula no especulativa quan totes les instruccions del superblock han comitejat. Segon, hem proposat un processador en order co-dissenyat que combina dos tipus d’acceleradors. Aquests acceleradors executen un parell d’instruccions fusionades. S’han considerat dos tipus de fusió d’instructions. Primer, combinem un parell de loads independents formant loads vectorials i els executem en una unitat vectorial. Segon, fusionem parells d’instruccions simples d’alu que són dependents i que s’executaran en una Interlock Collapsing ALU (ICALU). Per altra aquestes tecniques les hem evaluat conjuntament amb diverses optimizacions com list scheduling, load-store telescoping i hoisting de loads, entre d’altres. Aquesta proposta ha estat comparada amb un processador fora d’ordre. Tercer, hem proposat un processador fora d’ordre co-dissenyat efficient reduint-ne la complexitat en dos areas principals. En primer lloc, hem co-disenyat el mecanisme de commit per tal de permetre un eficient commit atòmic del superblocks. En aquesta solució hem substituït el ROB convencional, i en lloc hem introduït el Superblock Ordering Buffer (SOB). El SOB manté l’odre de programa a granularitat de superblock. L’estat del programa consisteix en registres i memòria. L’estat dels registres es manté en una taula per superblock, mentre que l’estat de memòria es guarda en un buffer i s’actulitza atòmicament. La segona gran area de reducció de complexitat considerarada és l’ús de FIFOs a la lògica d’issue. En aquest últim àmbit hem proposat una heurística de distribució que solventa les ineficiències de l’heurística basada en dependències anteriorment proposada. Finalment, i junt amb les FIFOs, s’ha proposat un mecanisme per alliberar les entrades de la FIFO anticipadament

    Improving processor efficiency by exploiting common-case behaviors of memory instructions

    Get PDF
    Processor efficiency can be described with the help of a number of  desirable effects or metrics, for example, performance, power, area, design complexity and access latency. These metrics serve as valuable tools used in designing new processors and they also act as  effective standards for comparing current processors. Various factors impact the efficiency of modern out-of-order processors and one important factor is the manner in which instructions are processed through the processor pipeline. In this dissertation research, we study the impact of load and store instructions (collectively known as memory instructions) on processor efficiency,  and show how to improve efficiency by exploiting common-case or  predictable patterns in the behavior of memory instructions. The memory behavior patterns that we focus on in our research are the predictability of memory dependences, the predictability in data forwarding patterns,   predictability in instruction criticality and conservativeness in resource allocation and deallocation policies. We first design a scalable  and high-performance memory dependence predictor and then apply accurate memory dependence prediction to improve the efficiency of the fetch engine of a simultaneous multi-threaded processor. We then use predictable data forwarding patterns to eliminate power-hungry  hardware in the processor with no loss in performance.  We then move to  studying instruction criticality to improve  processor efficiency. We study the behavior of critical load instructions  and propose applications that can be optimized using  predictable, load-criticality  information. Finally, we explore conventional techniques for allocation and deallocation  of critical structures that process memory instructions and propose new techniques to optimize the same.  Our new designs have the potential to reduce  the power and the area required by processors significantly without losing  performance, which lead to efficient designs of processors.Ph.D.Committee Chair: Loh, Gabriel H.; Committee Member: Clark, Nathan; Committee Member: Jaleel, Aamer; Committee Member: Kim, Hyesoon; Committee Member: Lee, Hsien-Hsin S.; Committee Member: Prvulovic, Milo

    Instruction scheduling in micronet-based asynchronous ILP processors

    Get PDF

    Optimizing SIMD execution in HW/SW co-designed processors

    Get PDF
    SIMD accelerators are ubiquitous in microprocessors from different computing domains. Their high compute power and hardware simplicity improve overall performance in an energy efficient manner. Moreover, their replicated functional units and simple control mechanism make them amenable to scaling to higher vector lengths. However, code generation for these accelerators has been a challenge from the days of their inception. Compilers generate vector code conservatively to ensure correctness. As a result they lose significant vectorization opportunities and fail to extract maximum benefits out of SIMD accelerators. This thesis proposes to vectorize the program binary at runtime in a speculative manner, in addition to the compile time static vectorization. There are different environments that support runtime profiling and optimization support required for dynamic vectorization, one of most prominent ones being: 1) Dynamic Binary Translators and Optimizers (DBTO) and 2) Hardware/Software (HW/SW) Co-designed Processors. HW/SW co-designed environment provides several advantages over DBTOs like transparent incorporations of new hardware features, binary compatibility, etc. Therefore, we use HW/SW co-designed environment to assess the potential of speculative dynamic vectorization. Furthermore, we analyze vector code generation for wider vector units and find out that even though SIMD accelerators are amenable to scaling from the hardware point of view, vector code generation at higher vector length is even more challenging. The two major factors impeding vectorization for wider SIMD units are: 1) Reduced dynamic instruction stream coverage for vectorization and 2) Large number of permutation instructions. To solve the first problem we propose Variable Length Vectorization that iteratively vectorizes for multiple vector lengths to improve dynamic instruction stream coverage. Secondly, to reduce the number of permutation instructions we propose Selective Writing that selectively writes to different parts of a vector register and avoids permutations. Finally, we tackle the problem of leakage energy in SIMD accelerators. Since SIMD accelerators consume significant amount of real estate on the chip, they become the principle source of leakage if not utilized judiciously. Power gating is one of the most widely used techniques to reduce leakage energy of functional units. However, power gating has its own energy and performance overhead associated with it. We propose to selectively devectorize the vector code when higher SIMD lanes are used intermittently. This selective devectorization keeps the higher SIMD lanes idle and power gated for maximum duration. Therefore, resulting in overall leakage energy reduction.Postprint (published version

    DEMAND-DRIVEN EXECUTION USING FUTURE GATED SINGLE ASSIGNMENT FORM

    Get PDF
    This dissertation discusses a novel, previously unexplored execution model called Demand-Driven Execution (DDE), which executes programs starting from the outputs of the program, progressing towards the inputs of the program. This approach is significantly different from prior demand-driven reduction machines as it can execute a program written in an imperative language using the demand-driven paradigm while extracting both instruction and data level parallelism. The execution model relies on an executable Single Assignment Form which serves both as the internal representation of the compiler as well as the Instruction Set Architecture (ISA) of the machine. This work develops the instruction set architecture, the programming language pragmatics, and the microarchitecture for the demand-driven execution paradigm

    Compiler-managed memory system for software-exposed architectures

    Get PDF
    Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2000.Includes bibliographical references (p. 155-161).Microprocessors must exploit both instruction-level parallelism (ILP) and memory parallelism for high performance. Sophisticated techniques for ILP have boosted the ability of modern-day microprocessors to exploit ILP when available. Unfortunately, improvements in memory parallelism in microprocessors have lagged behind. This thesis explains why memory parallelism is hard to exploit in microprocessors and advocate bank-exposed architectures as an effective way to exploit more memory parallelism. Bank exposed architectures are a kind of software-exposed architecture: one in which the low level details of the hardware are visible to the software. In a bank-exposed architecture, the memory banks are visible to the software, enabling the compiler to exploit a high degree of memory parallelism in addition to ILP. Bank-exposed architectures can be employed by general-purpose processors, and by embedded chips, such as those used for digital-signal processing. This thesis presents Maps, an enabling compiler technology for bank-exposed architectures. Maps solves the problem of bank-disambiguation, i.e., how to distribute data in sequential programs among several banks to best exploit memory parallelism, while retaining the ability to disambiguate each data reference to a particular bank. Two methods for bank disambiguation are presented: equivalence-class unification and modulo unrolling. Taking a sequential program as input, a bank-disambiguation method produces two outputs: first, a distribution of each program object among the memory banks; and second, a bank number for every reference that can be proven to access a single, known bank for that data distribution. Finally, the thesis shows why non-disambiguated accesses are sometimes desirable. Dependences between disambiguated and non-disambiguated accesses are enforced through explicit synchronization and software serial ordering. The MIT Raw machine is an example of a software-exposed architecture. Raw exposes its ILP, memory and communication mechanisms. The Maps system has been implemented in the Raw compiler. Results on Raw using sequential codes demonstrate that using bank disambiguation in addition to ILP improves performance by a factor of 3 to 5 over using ILP alone.by Rajeev Barua.Ph.D

    POOR MAN’S TRACE CACHE: A VARIABLE DELAY SLOT ARCHITECTURE

    Get PDF
    We introduce a novel fetch architecture called Poor Man’s Trace Cache (PMTC). PMTC constructs taken-path instruction traces via instruction replication in static code and inserts them after unconditional direct and select conditional direct control transfer instructions. These traces extend to the end of the cache line. Since available space for trace insertion may vary by the position of the control transfer instruction within the line, we refer to these fetch slots as variable delay slots. This approach ensures traces are fetched along with the control transfer instruction that initiated the trace. Branch, jump and return instruction semantics as well as the fetch unit are modified to utilize traces in delay slots. PMTC yields the following benefits: 1. Average fetch bandwidth increases as the front end can fetch across taken control transfer instructions in a single cycle. 2. The dynamic number of instruction cache lines fetched by the processor is reduced as multiple non contiguous basic blocks along a given path are encountered in one fetch cycle. 3. Replication of a branch instruction along multiple paths provides path separability for branches, which positively impacts branch prediction accuracy. PMTC mechanism requires minimal modifications to the processor’s fetch unit and the trace insertion algorithm can easily be implemented within the assembler without compiler support
    • …
    corecore