266 research outputs found

    A general framework to realize an abstract machine as an ILP processor with application to java

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Limits of a decoupled out-of-order superscalar architecture

    Get PDF

    THE EFFECTS OF AGGRESSIVE OUT-OF-ORDER MECHANISMS ON THE MEMORY SUB-SYSTEM

    Get PDF
    Contrary to existing work that demonstrate significant improvements in performance with larger reorder buffers, the work presented in this dissertation shows that larger instruction windows do not necessarily provide the significant improvements in performance. By using detailed models of the DRAM system and the memory subsystem, we show that increasing out-of-order aggressiveness by increasing reorder buffer sizes beyond 128 entries no longer buys any improvement in processor performance. In fact we observe that it can actually degrade processor performance. Additionally, this dissertation demonstrates a non-intuitive problem associated with the out-of-order execution of memory instructions: the reordering of memory instructions can cause a degradation in the performance of the memory subsystem. Specifically, we show that increasing out-of-order aggressiveness in terms of reorder buffer sizes increases the frequency of replay traps and data cache misses. The presentation of this problem in itself is of utmost significance: the very mechanisms commonly used to improve performance are sources of performance degradation in the memory subsystem. We observe that while the negative effects of out-of-order execution existed for only a small fraction of the time with small reorder buffers, eliminating other sources of stalls by increasing out-of-order capability introduces these unexpected side effects in the memory subsystem to represent significant overhead. This reveals that one can not overlook rarely occurring events in the memory subsystem. To gain insight on the source of the problem, we attempt to measure the degree to which memory system performance relies on out-of-order execution. Using the network communication concept of windowing, we decided to change the load/store scheduling window independently of the ALU scheduling window. Our study revealed that memory instructions issued out-of-order are the primary reason for the increase in the frequency of replay traps. On the other hand, the out-of-order issue of memory instructions is responsible for the constructive and destructive references to the data cache. Incorporating detailed memory subsystem models and a realistic DRAM model into existing simulators and filtering out the destructive references from the total cache references can allow for aggressive out-of-order cores to reap the true benefits of out-of-order execution

    Efficient Scaling of Out-of-Order Processor Resources

    Get PDF
    Rather than improving single-threaded performance, with the dawn of the multi-core era, processor microarchitects have exploited Moore's law transistor scaling by increasing core density on a chip and increasing the number of thread contexts within a core. However, single-thread performance and efficiency is still very relevant in the power-constrained multi-core era, as increasing core counts do not yield corresponding performance improvements under real thermal and thread-level constraints. This dissertation provides a detailed study of register reference count structures and its application to both conventional and non-conventional, latency tolerant, out-of-order processors. Prior work has incorporated reference counting, but without a detailed implementation or energy model. This dissertation presents a working implementation of reference count structures and shows the overheads are low and can be recouped by the techniques enabled in high-performance out-of-order processors. A study of register allocation algorithms exploits register file occupancy to reduce power consumption by dynamically resizing the register file, which is especially important in the face of wider multi-threaded processors who require larger register files. Latency tolerance has been introduced as a technique to improve single threaded performance by removing cache-miss dependent instructions from the execution pipeline until the miss returns. This dissertation introduces a microarchitecture with a predictive approach to identify long-latency loads, and reduce the energy cost and overhead of scaling the instruction window inherent in latency tolerant microarchitectures. The key features include a front-end predictive slice-out mechanism and in-order queue structure along with mechanisms to reduce the energy cost and register-file usage of executing instructions. Cycle-level simulation shows improved performance and reduced energy delay for memory-bound workloads. Both techniques scale processor resources, addressing register file inefficiency and the allocation of processor resources to instructions during low ILP regions.Ph.D., Computer Engineering -- Drexel University, 201

    Hybrid Designs for Caches and Cores.

    Full text link
    Processor power constraints have come to the forefront over the last decade, heralded by the stagnation of clock frequency scaling. High-performance core and cache designs often utilize power-hungry techniques to increase parallelism. Conversely, the most energy-efficient designs opt for a serial execution to avoid unnecessary overheads. While both of these extremes constitute one-size-fits-all approaches, a judicious mix of parallel and serial execution has the potential to achieve the best of both high-performing and energy-efficient designs. This dissertation examines such hybrid designs for cores and caches. Firstly, we introduce a novel, hybrid out-of-order/in-order core microarchitecture. Instructions that are steered towards in-order execution skip register allocation, reordering and dynamic scheduling. At the same time, these instructions can interleave on an instruction-by-instruction basis with instructions that continue to benefit from these conventional out-of-order mechanisms. Secondly, this dissertation revisits a hybrid technique introduced for L1 caches, way-prediction, in the context of last-level caches that are larger, have higher associativity, and experience less locality.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113484/1/sleimanf_1.pd

    On the classification and evaluation of prefetching schemes

    Get PDF
    Abstract available: p. [2
    corecore