45 research outputs found

    Replacement and placement policies for prefetched lines.

    Get PDF
    by Sze Siu Ching.Thesis (M.Phil.)--Chinese University of Hong Kong, 1998.Includes bibliographical references (leaves 119-122).Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Overlapping Computations with Memory Accesses --- p.3Chapter 1.2 --- Cache Line Replacement Policies --- p.4Chapter 1.3 --- The Rest of This Paper --- p.4Chapter 2 --- A Brief Review of IAP Scheme --- p.6Chapter 2.1 --- Embedded Hints for Next Data References --- p.6Chapter 2.2 --- Instruction Opcode and Addressing Mode Prefetching --- p.8Chapter 2.3 --- Chapter Summary --- p.9Chapter 3 --- Motivation --- p.11Chapter 3.1 --- Chapter Summary --- p.14Chapter 4 --- Related Work --- p.15Chapter 4.1 --- Existing Replacement Algorithms --- p.16Chapter 4.2 --- Placement Policies for Cache Lines --- p.18Chapter 4.3 --- Chapter Summary --- p.20Chapter 5 --- Replacement and Placement Policies of Prefetched Lines --- p.21Chapter 5.1 --- IZ Cache Line Replacement Policy in IAP scheme --- p.22Chapter 5.1.1 --- The Instant Zero Scheme --- p.23Chapter 5.2 --- Priority Pre-Updating and Victim Cache --- p.27Chapter 5.2.1 --- Priority Pre-Updating --- p.27Chapter 5.2.2 --- Priority Pre-Updating for Cache --- p.28Chapter 5.2.3 --- Victim Cache for Unreferenced Prefetch Lines --- p.28Chapter 5.3 --- Prefetch Cache for IAP Lines --- p.31Chapter 5.4 --- Chapter Summary --- p.33Chapter 6 --- Performance Evaluation --- p.34Chapter 6.1 --- Methodology and metrics --- p.34Chapter 6.1.1 --- Trace Driven Simulation --- p.35Chapter 6.1.2 --- Caching Models --- p.36Chapter 6.1.3 --- Simulation Models and Performance Metrics --- p.39Chapter 6.2 --- Simulation Results --- p.43Chapter 6.2.1 --- General Results --- p.44Chapter 6.3 --- Simulation Results of IZ Replacement Policy --- p.49Chapter 6.3.1 --- Analysis To IZ Cache Line Replacement Policy --- p.50Chapter 6.4 --- Simulation Results for Priority Pre-Updating with Victim Cache --- p.52Chapter 6.4.1 --- PPUVC in Cache with IAP Scheme --- p.52Chapter 6.4.2 --- PPUVC in prefetch-on-miss Cache --- p.54Chapter 6.5 --- Prefetch Cache --- p.57Chapter 6.6 --- Chapter Summary --- p.63Chapter 7 --- Architecture Without LOAD-AND-STORE Instructions --- p.64Chapter 8 --- Conclusion --- p.66Chapter A --- CPI Due to Cache Misses --- p.68Chapter A.1 --- Varying Cache Size --- p.68Chapter A.1.1 --- Instant Zero Replacement Policy --- p.68Chapter A.1.2 --- Priority Pre-Updating with Victim Cache --- p.70Chapter A.1.3 --- Prefetch Cache --- p.73Chapter A.2 --- Varying Cache Line Size --- p.75Chapter A.2.1 --- Instant Zero Replacement Policy --- p.75Chapter A.2.2 --- Priority Pre-Updating with Victim Cache --- p.77Chapter A.2.3 --- Prefetch Cache --- p.80Chapter A.3 --- Varying Cache Set Associative --- p.82Chapter A.3.1 --- Instant Zero Replacement Policy --- p.82Chapter A.3.2 --- Priority Pre-Updating with Victim Cache --- p.84Chapter A.3.3 --- Prefetch Cache --- p.87Chapter B --- Simulation Results of IZ Replacement Policy --- p.89Chapter B.1 --- Memory Delay Time Reduction --- p.89Chapter B.1.1 --- Varying Cache Size --- p.89Chapter B.1.2 --- Varying Cache Line Size --- p.91Chapter B.1.3 --- Varying Cache Set Associative --- p.93Chapter C --- Simulation Results of Priority Pre-Updating with Victim Cache --- p.95Chapter C.1 --- PPUVC in IAP Scheme --- p.95Chapter C.1.1 --- Memory Delay Time Reduction --- p.95Chapter C.2 --- PPUVC in Cache with Prefetch-On-Miss Only --- p.101Chapter C.2.1 --- Memory Delay Time Reduction --- p.101Chapter D --- Simulation Results of Prefetch Cache --- p.107Chapter D.1 --- Memory Delay Time Reduction --- p.107Chapter D.1.1 --- Varying Cache Size --- p.107Chapter D.1.2 --- Varying Cache Line Size --- p.109Chapter D.1.3 --- Varying Cache Set Associative --- p.111Chapter D.2 --- Results of the Three Replacement Policies --- p.113Chapter D.2.1 --- Varying Cache Size --- p.113Chapter D.2.2 --- Varying Cache Line Size --- p.115Chapter D.2.3 --- Varying Cache Set Associative --- p.117Bibliography --- p.11

    Architectural Approaches For Gallium Arsenide Exploitation In High-Speed Computer Design

    Get PDF
    Continued advances in the capability of Gallium Arsenide (GaAs)technology have finally drawn serious interest from computer system designers. The recent demonstration of very large scale integration (VLSI) laboratory designs incorporating very fast GaAs logic gates herald a significant role for GaAs technology in high-speed computer design:1 In this thesis we investigate design approaches to best exploit this promising technology in high-performance computer systems. We find significant differences between GaAs and Silicon technologies which are of relevance for computer design. The advantage that GaAs enjoys over Silicon in faster transistor switching speed is countered by a lower transistor count capability for GaAs integrated circuits. In addition, inter-chip signal propagation speeds in GaAs systems do not experience the same speedup exhibited by GaAs transistors; thus, GaAs designs are penalized more severely by inter-chip communication. The relatively low density of GaAs chips and the high cost of communication between them are significant obstacles to the full exploitation of the fast transistors of GaAs technology. A fast GaAs processor may be excessively underutilized unless special consideration is given to its information (instructions and data) requirements. Desirable GaAs system design approaches encourage low hardware resource requirements, and either minimize the processor’s need for off-chip information, maximize the rate of off-chip information transfer, or overlap off-chip information transfer with useful computation. We show the impact that these considerations have on the design of the instruction format, arithmetic unit, memory system, and compiler for a GaAs computer system. Through a simulation study utilizing a set of widely-used benchmark programs, we investigate several candidate instruction pipelines and candidate instruction formats in a GaAs environment. We demonstrate the clear performance advantage of an instruction pipeline based upon a pipelined memory system over a typical Silicon-like pipeline. We also show the performance advantage of packed instruction formats over typical Silicon instruction formats, and present a packed format which performs better than the experimental packed Stanford MIPS format

    Unified on-chip multi-level cache management scheme using processor opcodes and addressing modes.

    Get PDF
    by Stephen Siu-ming Wong.Thesis (M.Phil.)--Chinese University of Hong Kong, 1996.Includes bibliographical references (leaves 164-170).Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Cache Memory --- p.2Chapter 1.2 --- System Performance --- p.3Chapter 1.3 --- Cache Performance --- p.3Chapter 1.4 --- Cache Prefetching --- p.5Chapter 1.5 --- Organization of Dissertation --- p.7Chapter 2 --- Related Work --- p.8Chapter 2.1 --- Memory Hierarchy --- p.8Chapter 2.2 --- Cache Memory Management --- p.10Chapter 2.2.1 --- Configuration --- p.10Chapter 2.2.2 --- Replacement Algorithms --- p.13Chapter 2.2.3 --- Write Back Policies --- p.15Chapter 2.2.4 --- Cache Miss Types --- p.16Chapter 2.2.5 --- Prefetching --- p.17Chapter 2.3 --- Locality --- p.18Chapter 2.3.1 --- Spatial vs. Temporal --- p.18Chapter 2.3.2 --- Instruction Cache vs. Data Cache --- p.20Chapter 2.4 --- Why Not a Large L1 Cache? --- p.26Chapter 2.4.1 --- Critical Time Path --- p.26Chapter 2.4.2 --- Hardware Cost --- p.27Chapter 2.5 --- Trend to have L2 Cache On Chip --- p.28Chapter 2.5.1 --- Examples --- p.29Chapter 2.5.2 --- Dedicated L2 Bus --- p.31Chapter 2.6 --- Hardware Prefetch Algorithms --- p.32Chapter 2.6.1 --- One Block Look-ahead --- p.33Chapter 2.6.2 --- Chen's RPT & similar algorithms --- p.34Chapter 2.7 --- Software Based Prefetch Algorithm --- p.38Chapter 2.7.1 --- Prefetch Instruction --- p.38Chapter 2.8 --- Hybrid Prefetch Algorithm --- p.40Chapter 2.8.1 --- Stride CAM Prefetching --- p.40Chapter 3 --- Simulator --- p.43Chapter 3.1 --- Multi-level Memory Hierarchy Simulator --- p.43Chapter 3.1.1 --- Multi-level Memory Support --- p.45Chapter 3.1.2 --- Non-blocking Cache --- p.45Chapter 3.1.3 --- Cycle-by-cycle Simulation --- p.47Chapter 3.1.4 --- Cache Prefetching Support --- p.47Chapter 4 --- Proposed Algorithms --- p.48Chapter 4.1 --- SIRPA --- p.48Chapter 4.1.1 --- Rationale --- p.48Chapter 4.1.2 --- Architecture Model --- p.50Chapter 4.2 --- Line Concept --- p.56Chapter 4.2.1 --- Rationale --- p.56Chapter 4.2.2 --- "Improvement Over ""Pure"" Algorithm" --- p.57Chapter 4.2.3 --- Architectural Model --- p.59Chapter 4.3 --- Combined L1-L2 Cache Management --- p.62Chapter 4.3.1 --- Rationale --- p.62Chapter 4.3.2 --- Feasibility --- p.63Chapter 4.4 --- Combine SIRPA with Default Prefetch --- p.66Chapter 4.4.1 --- Rationale --- p.67Chapter 4.4.2 --- Improvement Over “Pure´ح Algorithm --- p.69Chapter 4.4.3 --- Architectural Model --- p.70Chapter 5 --- Results --- p.73Chapter 5.1 --- Benchmarks Used --- p.73Chapter 5.1.1 --- SPEC92int and SPEC92fp --- p.75Chapter 5.2 --- Configurations Tested --- p.79Chapter 5.2.1 --- Prefetch Algorithms --- p.79Chapter 5.2.2 --- Cache Sizes --- p.80Chapter 5.2.3 --- Cache Block Sizes --- p.81Chapter 5.2.4 --- Cache Set Associativities --- p.81Chapter 5.2.5 --- "Bus Width, Speed and Other Parameters" --- p.81Chapter 5.3 --- Validity of Results --- p.83Chapter 5.3.1 --- Total Instructions and Cycles --- p.83Chapter 5.3.2 --- Total Reference to Caches --- p.84Chapter 5.4 --- Overall MCPI Comparison --- p.86Chapter 5.4.1 --- Cache Size Effect --- p.87Chapter 5.4.2 --- Cache Block Size Effect --- p.91Chapter 5.4.3 --- Set Associativity Effect --- p.101Chapter 5.4.4 --- Hardware Prefetch Algorithms --- p.108Chapter 5.4.5 --- Software Based Prefetch Algorithms --- p.119Chapter 5.5 --- L2 Cache & Main Memory MCPI Comparison --- p.127Chapter 5.5.1 --- Cache Size Effect --- p.130Chapter 5.5.2 --- Cache Block Size Effect --- p.130Chapter 5.5.3 --- Set Associativity Effect --- p.143Chapter 6 --- Conclusion --- p.154Chapter 7 --- Future Directions --- p.157Chapter 7.1 --- Prefetch Buffer --- p.157Chapter 7.2 --- Dissimilar L1-L2 Management --- p.158Chapter 7.3 --- Combined LRU/MRU Replacement Policy --- p.160Chapter 7.4 --- N Loops Look-ahead --- p.16

    Data prefetching using hardware register value predictable table.

    Get PDF
    by Chin-Ming, Cheung.Thesis (M.Phil.)--Chinese University of Hong Kong, 1996.Includes bibliographical references (leaves 95-97).Abstract --- p.iAcknowledgement --- p.iiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Overview --- p.1Chapter 1.2 --- Objective --- p.3Chapter 1.3 --- Organization of the dissertation --- p.4Chapter 2 --- Related Works --- p.6Chapter 2.1 --- Previous Cache Works --- p.6Chapter 2.2 --- Data Prefetching Techniques --- p.7Chapter 2.2.1 --- Hardware Vs Software Assisted --- p.7Chapter 2.2.2 --- Non-selective Vs Highly Selective --- p.8Chapter 2.2.3 --- Summary on Previous Data Prefetching Schemes --- p.12Chapter 3 --- Program Data Mapping --- p.13Chapter 3.1 --- Regular and Irregular Data Access --- p.13Chapter 3.2 --- Propagation of Data Access Regularity --- p.16Chapter 3.2.1 --- Data Access Regularity in High Level Program --- p.17Chapter 3.2.2 --- Data Access Regularity in Machine Code --- p.18Chapter 3.2.3 --- Data Access Regularity in Memory Address Sequence --- p.20Chapter 3.2.4 --- Implication --- p.21Chapter 4 --- Register Value Prediction Table (RVPT) --- p.22Chapter 4.1 --- Predictability of Register Values --- p.23Chapter 4.2 --- Register Value Prediction Table --- p.26Chapter 4.3 --- Control Scheme of RVPT --- p.29Chapter 4.3.1 --- Details of RVPT Mechanism --- p.29Chapter 4.3.2 --- Explanation of the Register Prediction Mechanism --- p.32Chapter 4.4 --- Examples of RVPT --- p.35Chapter 4.4.1 --- Linear Array Example --- p.35Chapter 4.4.2 --- Linked List Example --- p.36Chapter 5 --- Program Register Dependency --- p.39Chapter 5.1 --- Register Dependency --- p.40Chapter 5.2 --- Generalized Concept of Register --- p.44Chapter 5.2.1 --- Cyclic Dependent Register(CDR) --- p.44Chapter 5.2.2 --- Acyclic Dependent Register(ADR) --- p.46Chapter 5.3 --- Program Register Overview --- p.47Chapter 6 --- Generalized RVPT Model --- p.49Chapter 6.1 --- Level N RVPT Model --- p.49Chapter 6.1.1 --- Identification of Level N CDR --- p.51Chapter 6.1.2 --- Recording CDR instructions of Level N CDR --- p.53Chapter 6.1.3 --- Prediction of Level N CDR --- p.55Chapter 6.2 --- Level 2 Register Value Prediction Table --- p.55Chapter 6.2.1 --- Level 2 RVPT Structure --- p.56Chapter 6.2.2 --- Identification of Level 2 CDR --- p.58Chapter 6.2.3 --- Control Scheme of Level 2 RVPT --- p.59Chapter 6.2.4 --- Example of Index Array --- p.63Chapter 7 --- Performance Evaluation --- p.66Chapter 7.1 --- Evaluation Methodology --- p.66Chapter 7.1.1 --- Trace-Drive Simulation --- p.66Chapter 7.1.2 --- Architectural Method --- p.68Chapter 7.1.3 --- Benchmarks and Metrics --- p.70Chapter 7.2 --- General Result --- p.75Chapter 7.2.1 --- Constant Stride or Regular Data Access Applications --- p.77Chapter 7.2.2 --- Non-constant Stride or Irregular Data Access Applications --- p.79Chapter 7.3 --- Effect of Design Variations --- p.80Chapter 7.3.1 --- Effect of Cache Size --- p.81Chapter 7.3.2 --- Effect of Block Size --- p.83Chapter 7.3.3 --- Effect of Set Associativity --- p.86Chapter 7.4 --- Summary --- p.87Chapter 8 --- Conclusion and Future Research --- p.88Chapter 8.1 --- Conclusion --- p.88Chapter 8.2 --- Future Research --- p.90Bibliography --- p.95Appendix --- p.98Chapter A --- MCPI vs. cache size --- p.98Chapter B --- MCPI Reduction Percentage Vs cache size --- p.102Chapter C --- MCPI vs. block size --- p.106Chapter D --- MCPI Reduction Percentage Vs block size --- p.110Chapter E --- MCPI vs. set-associativity --- p.114Chapter F --- MCPI Reduction Percentage Vs set-associativity --- p.11

    ADAM : a decentralized parallel computer architecture featuring fast thread and data migration and a uniform hardware abstraction

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2002.Includes bibliographical references (p. 247-256).The furious pace of Moore's Law is driving computer architecture into a realm where the the speed of light is the dominant factor in system latencies. The number of clock cycles to span a chip are increasing, while the number of bits that can be accessed within a clock cycle is decreasing. Hence, it is becoming more difficult to hide latency. One alternative solution is to reduce latency by migrating threads and data, but the overhead of existing implementations has previously made migration an unserviceable solution so far. I present an architecture, implementation, and mechanisms that reduces the overhead of migration to the point where migration is a viable supplement to other latency hiding mechanisms, such as multithreading. The architecture is abstract, and presents programmers with a simple, uniform fine-grained multithreaded parallel programming model with implicit memory management. In other words, the spatial nature and implementation details (such as the number of processors) of a parallel machine are entirely hidden from the programmer. Compiler writers are encouraged to devise programming languages for the machine that guide a programmer to express their ideas in terms of objects, since objects exhibit an inherent physical locality of data and code. The machine implementation can then leverage this locality to automatically distribute data and threads across the physical machine by using a set of high performance migration mechanisms.(cont.) An implementation of this architecture could migrate a null thread in 66 cycles - over a factor of 1000 improvement over previous work. Performance also scales well; the time required to move a typical thread is only 4 to 5 times that of a null thread. Data migration performance is similar, and scales linearly with data block size. Since the performance of the migration mechanism is on par with that of an L2 cache, the implementation simulated in my work has no data caches and relies instead on multithreading and the migration mechanism to hide and reduce access latencies.by Andrew "bunnie" Huang.Ph.D

    ADAM: A Decentralized Parallel Computer Architecture Featuring Fast Thread and Data Migration and a Uniform Hardware Abstraction

    Get PDF
    The furious pace of Moore's Law is driving computer architecture into a realm where the the speed of light is the dominant factor in system latencies. The number of clock cycles to span a chip are increasing, while the number of bits that can be accessed within a clock cycle is decreasing. Hence, it is becoming more difficult to hide latency. One alternative solution is to reduce latency by migrating threads and data, but the overhead of existing implementations has previously made migration an unserviceable solution so far. I present an architecture, implementation, and mechanisms that reduces the overhead of migration to the point where migration is a viable supplement to other latency hiding mechanisms, such as multithreading. The architecture is abstract, and presents programmers with a simple, uniform fine-grained multithreaded parallel programming model with implicit memory management. In other words, the spatial nature and implementation details (such as the number of processors) of a parallel machine are entirely hidden from the programmer. Compiler writers are encouraged to devise programming languages for the machine that guide a programmer to express their ideas in terms of objects, since objects exhibit an inherent physical locality of data and code. The machine implementation can then leverage this locality to automatically distribute data and threads across the physical machine by using a set of high performance migration mechanisms. An implementation of this architecture could migrate a null thread in 66 cycles -- over a factor of 1000 improvement over previous work. Performance also scales well; the time required to move a typical thread is only 4 to 5 times that of a null thread. Data migration performance is similar, and scales linearly with data block size. Since the performance of the migration mechanism is on par with that of an L2 cache, the implementation simulated in my work has no data caches and relies instead on multithreading and the migration mechanism to hide and reduce access latencies

    AMC: Advanced Multi-accelerator Controller

    Get PDF
    The rapid advancement, use of diverse architectural features and introduction of High Level Synthesis (HLS) tools in FPGA technology have enhanced the capacity of data-level parallelism on a chip. A generic FPGA based HLS multi-accelerator system requires a microprocessor (master core) that manages memory and schedules accelerators. In a real environment, such HLS multi-accelerator systems do not give a perfect performance due to memory bandwidth issues. Thus, a system demands a memory manager and a scheduler that improves performance by managing and scheduling the multi-accelerator’s memory access patterns efficiently. In this article, we propose the integration of an intelligent memory system and efficient scheduler in the HLS-based multi-accelerator environment called Advanced Multi-accelerator Controller (AMC). The AMC system is evaluated with memory intensive accelerators, High Performance Computing (HPC) applications and implemented and tested on a Xilinx Virtex-5 ML505 evaluation FPGA board. The performance of the system is compared against the microprocessor-based systems that have been integrated with the operating system. Results show that the AMC based HLS multi-accelerator system achieves 10.4x and 7x of speedup compared to the MicroBlaze and Intel Core based HLS multi-accelerator systems.Peer ReviewedPostprint (author’s final draft
    corecore