14 research outputs found

    A Flexible Multi-port Caching Scheme for Reconfigurable Platforms

    Full text link

    Reducing the complexity of the register file in dynamic superscalar processors

    Get PDF
    Journal ArticleDynamic superscalar processors execute multiple instructions out-of-order by looking for independent operations within a large window. The number of physical registers within the processor has a direct impact on the size of this window as most in-flight instructions require a new physical register at dispatch. A large multi-ported register file helps improve the instruction-level parallelism (ILP), but may have a detrimental effect on clock speed, especially in future wire-limited technologies. In this paper, we propose a register file organization that reduces register file size and port requirements for a given amount of ILP. We use a two-level register file organization to reduce register file size requirements, and a banked organization to reduce port requirements. We demonstrate empirically that the resulting register file organizations have reduced latency and (in the case of the banked organization) energy requirements for similar instructions per cycle (IPC) performance and improved instructions per second (IPS) performance in comparison to a conventional monolithic register file. The choice of organization is dependent on design goals

    Dynamic data memory partitioning for access region caches

    Get PDF
    For wide-issue processors, data cache needs to be heavily multi-ported with extremely wide data-paths. A recent proposal of multi-porting cache design divides memory streams into multiple independent sub-streams with the help of prediction mechanism before they enter the reservation stations. Partitioned memory-reference instructions are then fed into separate memory pipelines, each of which is connected to a small data-cache, called access region cache (ARC). A selection function for mapping memory references to each ARC can affect the data memory bandwidth as conflicts and load balance at each ARC may differ. In this thesis, we study various static and dynamic memory partitioning methods to see the effects of distributing memory references among the ARCS through exposing memory traffic of those designs. Six different approaches of distributing memory references, including two randomization methods and two dynamic methods, are considered. The potential effects on the memory performance with ARC are measured and compared with existing multi-porting solution as well as an ideal multi-ported data cache. This study concludes that scattering access conflicts dynamically, redirecting conflicting references dynamically to different ARCs at each cycle, can increase the memory bandwidth. However, increasing data bandwidth alone does not always results in performance improvement. Keeping the cache miss rate low is as important as sufficient memory bandwidth to achieve higher performance in wide-issue processors

    Dead-block prediction & dead-block correlating prefetchers

    Get PDF
    Effective data prefetching requires accurate mechanisms to predict both “which” cache blocks to prefetch and “when” to prefetch them. This paper proposes the Dead-Block Predictors (DBPs), trace-based predictors that accurately identify “when” an Ll data cache block becomes evictable or “dead”. Predicting a dead block significantly enhances prefetching lookahead and opportunity, and enables placing data directly into Ll, obviating the need for auxiliary prefetch buffers. This paper also proposes Dead-Block Correlating Prefetchers (DBCPs), that use address correlation to predict “which” subsequent block to prefetch when a block becomes evictable. A DBCP enables effective data prefetching in a wide spectrum of pointer- intensive, integer, and floating-point applications. We use cycle-accurate simulation of an out-of-order superscalar processor and memory-intensive benchmarks to show that: (1) dead-block prediction enhances prefetching lookahead at least by an order of magnitude as compared to previous techniques, (2) a DBP can predict dead blocks on average with a coverage of 90% only mispredicting 4% of the time, (3) a DBCP offers an address prediction coverage of 86% only mispredicting 3% of the time, and (4) DBCPs improve performance by 62% on average and 282% at best in the benchmarks we studie

    Dead-block prediction & dead-block correlating prefetchers

    Full text link

    Approach to applying the contour model to the design of a hypothetical multiple-register-window architecture for the block-structured process

    Get PDF
    The concepts described in this thesis are towards the implementation of the basic functions of a pipelined, load/store, multiple-register-window and scalar-oriented uniprocessor architecture. During the formation phase of these concepts, I am glad to have the opportunity to investigating the interrelation of computer architectures, data structures and systems programming, which are the fundamentals underlying virtually every software design. I also took pleasure in learning A WK and C++ programming languages (only the elementary things of the latter, however) for the simulation conducted in this thesis and the UNIXďż˝ document formatting/typesetting tools for the preparation of the text and figures presented in this thesis on the UNIXďż˝-based PerkinElmer 3230 computer system of the Computer Science Department.Computer Scienc

    Task Activity Vectors: A Novel Metric for Temperature-Aware and Energy-Efficient Scheduling

    Get PDF
    This thesis introduces the abstraction of the task activity vector to characterize applications by the processor resources they utilize. Based on activity vectors, the thesis introduces scheduling policies for improving the temperature distribution on the processor chip and for increasing energy efficiency by reducing the contention for shared resources of multicore and multithreaded processors

    High-Bandwidth Data Memory Systems for Superscalar Processors

    No full text
    corecore