45 research outputs found
Replacement and placement policies for prefetched lines.
by Sze Siu Ching.Thesis (M.Phil.)--Chinese University of Hong Kong, 1998.Includes bibliographical references (leaves 119-122).Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Overlapping Computations with Memory Accesses --- p.3Chapter 1.2 --- Cache Line Replacement Policies --- p.4Chapter 1.3 --- The Rest of This Paper --- p.4Chapter 2 --- A Brief Review of IAP Scheme --- p.6Chapter 2.1 --- Embedded Hints for Next Data References --- p.6Chapter 2.2 --- Instruction Opcode and Addressing Mode Prefetching --- p.8Chapter 2.3 --- Chapter Summary --- p.9Chapter 3 --- Motivation --- p.11Chapter 3.1 --- Chapter Summary --- p.14Chapter 4 --- Related Work --- p.15Chapter 4.1 --- Existing Replacement Algorithms --- p.16Chapter 4.2 --- Placement Policies for Cache Lines --- p.18Chapter 4.3 --- Chapter Summary --- p.20Chapter 5 --- Replacement and Placement Policies of Prefetched Lines --- p.21Chapter 5.1 --- IZ Cache Line Replacement Policy in IAP scheme --- p.22Chapter 5.1.1 --- The Instant Zero Scheme --- p.23Chapter 5.2 --- Priority Pre-Updating and Victim Cache --- p.27Chapter 5.2.1 --- Priority Pre-Updating --- p.27Chapter 5.2.2 --- Priority Pre-Updating for Cache --- p.28Chapter 5.2.3 --- Victim Cache for Unreferenced Prefetch Lines --- p.28Chapter 5.3 --- Prefetch Cache for IAP Lines --- p.31Chapter 5.4 --- Chapter Summary --- p.33Chapter 6 --- Performance Evaluation --- p.34Chapter 6.1 --- Methodology and metrics --- p.34Chapter 6.1.1 --- Trace Driven Simulation --- p.35Chapter 6.1.2 --- Caching Models --- p.36Chapter 6.1.3 --- Simulation Models and Performance Metrics --- p.39Chapter 6.2 --- Simulation Results --- p.43Chapter 6.2.1 --- General Results --- p.44Chapter 6.3 --- Simulation Results of IZ Replacement Policy --- p.49Chapter 6.3.1 --- Analysis To IZ Cache Line Replacement Policy --- p.50Chapter 6.4 --- Simulation Results for Priority Pre-Updating with Victim Cache --- p.52Chapter 6.4.1 --- PPUVC in Cache with IAP Scheme --- p.52Chapter 6.4.2 --- PPUVC in prefetch-on-miss Cache --- p.54Chapter 6.5 --- Prefetch Cache --- p.57Chapter 6.6 --- Chapter Summary --- p.63Chapter 7 --- Architecture Without LOAD-AND-STORE Instructions --- p.64Chapter 8 --- Conclusion --- p.66Chapter A --- CPI Due to Cache Misses --- p.68Chapter A.1 --- Varying Cache Size --- p.68Chapter A.1.1 --- Instant Zero Replacement Policy --- p.68Chapter A.1.2 --- Priority Pre-Updating with Victim Cache --- p.70Chapter A.1.3 --- Prefetch Cache --- p.73Chapter A.2 --- Varying Cache Line Size --- p.75Chapter A.2.1 --- Instant Zero Replacement Policy --- p.75Chapter A.2.2 --- Priority Pre-Updating with Victim Cache --- p.77Chapter A.2.3 --- Prefetch Cache --- p.80Chapter A.3 --- Varying Cache Set Associative --- p.82Chapter A.3.1 --- Instant Zero Replacement Policy --- p.82Chapter A.3.2 --- Priority Pre-Updating with Victim Cache --- p.84Chapter A.3.3 --- Prefetch Cache --- p.87Chapter B --- Simulation Results of IZ Replacement Policy --- p.89Chapter B.1 --- Memory Delay Time Reduction --- p.89Chapter B.1.1 --- Varying Cache Size --- p.89Chapter B.1.2 --- Varying Cache Line Size --- p.91Chapter B.1.3 --- Varying Cache Set Associative --- p.93Chapter C --- Simulation Results of Priority Pre-Updating with Victim Cache --- p.95Chapter C.1 --- PPUVC in IAP Scheme --- p.95Chapter C.1.1 --- Memory Delay Time Reduction --- p.95Chapter C.2 --- PPUVC in Cache with Prefetch-On-Miss Only --- p.101Chapter C.2.1 --- Memory Delay Time Reduction --- p.101Chapter D --- Simulation Results of Prefetch Cache --- p.107Chapter D.1 --- Memory Delay Time Reduction --- p.107Chapter D.1.1 --- Varying Cache Size --- p.107Chapter D.1.2 --- Varying Cache Line Size --- p.109Chapter D.1.3 --- Varying Cache Set Associative --- p.111Chapter D.2 --- Results of the Three Replacement Policies --- p.113Chapter D.2.1 --- Varying Cache Size --- p.113Chapter D.2.2 --- Varying Cache Line Size --- p.115Chapter D.2.3 --- Varying Cache Set Associative --- p.117Bibliography --- p.11
Architectural Approaches For Gallium Arsenide Exploitation In High-Speed Computer Design
Continued advances in the capability of Gallium Arsenide (GaAs)technology have finally drawn serious interest from computer system designers. The recent demonstration of very large scale integration (VLSI) laboratory designs incorporating very fast GaAs logic gates herald a significant role for GaAs technology in high-speed computer design:1 In this thesis we investigate design approaches to best exploit this promising technology in high-performance computer systems. We find significant differences between GaAs and Silicon technologies which are of relevance for computer design. The advantage that GaAs enjoys over Silicon in faster transistor switching speed is countered by a lower transistor count capability for GaAs integrated circuits. In addition, inter-chip signal propagation speeds in GaAs systems do not experience the same speedup exhibited by GaAs transistors; thus, GaAs designs are penalized more severely by inter-chip communication. The relatively low density of GaAs chips and the high cost of communication between them are significant obstacles to the full exploitation of the fast transistors of GaAs technology. A fast GaAs processor may be excessively underutilized unless special consideration is given to its information (instructions and data) requirements. Desirable GaAs system design approaches encourage low hardware resource requirements, and either minimize the processor’s need for off-chip information, maximize the rate of off-chip information transfer, or overlap off-chip information transfer with useful computation. We show the impact that these considerations have on the design of the instruction format, arithmetic unit, memory system, and compiler for a GaAs computer system. Through a simulation study utilizing a set of widely-used benchmark programs, we investigate several candidate instruction pipelines and candidate instruction formats in a GaAs environment. We demonstrate the clear performance advantage of an instruction pipeline based upon a pipelined memory system over a typical Silicon-like pipeline. We also show the performance advantage of packed instruction formats over typical Silicon instruction formats, and present a packed format which performs better than the experimental packed Stanford MIPS format
Unified on-chip multi-level cache management scheme using processor opcodes and addressing modes.
by Stephen Siu-ming Wong.Thesis (M.Phil.)--Chinese University of Hong Kong, 1996.Includes bibliographical references (leaves 164-170).Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Cache Memory --- p.2Chapter 1.2 --- System Performance --- p.3Chapter 1.3 --- Cache Performance --- p.3Chapter 1.4 --- Cache Prefetching --- p.5Chapter 1.5 --- Organization of Dissertation --- p.7Chapter 2 --- Related Work --- p.8Chapter 2.1 --- Memory Hierarchy --- p.8Chapter 2.2 --- Cache Memory Management --- p.10Chapter 2.2.1 --- Configuration --- p.10Chapter 2.2.2 --- Replacement Algorithms --- p.13Chapter 2.2.3 --- Write Back Policies --- p.15Chapter 2.2.4 --- Cache Miss Types --- p.16Chapter 2.2.5 --- Prefetching --- p.17Chapter 2.3 --- Locality --- p.18Chapter 2.3.1 --- Spatial vs. Temporal --- p.18Chapter 2.3.2 --- Instruction Cache vs. Data Cache --- p.20Chapter 2.4 --- Why Not a Large L1 Cache? --- p.26Chapter 2.4.1 --- Critical Time Path --- p.26Chapter 2.4.2 --- Hardware Cost --- p.27Chapter 2.5 --- Trend to have L2 Cache On Chip --- p.28Chapter 2.5.1 --- Examples --- p.29Chapter 2.5.2 --- Dedicated L2 Bus --- p.31Chapter 2.6 --- Hardware Prefetch Algorithms --- p.32Chapter 2.6.1 --- One Block Look-ahead --- p.33Chapter 2.6.2 --- Chen's RPT & similar algorithms --- p.34Chapter 2.7 --- Software Based Prefetch Algorithm --- p.38Chapter 2.7.1 --- Prefetch Instruction --- p.38Chapter 2.8 --- Hybrid Prefetch Algorithm --- p.40Chapter 2.8.1 --- Stride CAM Prefetching --- p.40Chapter 3 --- Simulator --- p.43Chapter 3.1 --- Multi-level Memory Hierarchy Simulator --- p.43Chapter 3.1.1 --- Multi-level Memory Support --- p.45Chapter 3.1.2 --- Non-blocking Cache --- p.45Chapter 3.1.3 --- Cycle-by-cycle Simulation --- p.47Chapter 3.1.4 --- Cache Prefetching Support --- p.47Chapter 4 --- Proposed Algorithms --- p.48Chapter 4.1 --- SIRPA --- p.48Chapter 4.1.1 --- Rationale --- p.48Chapter 4.1.2 --- Architecture Model --- p.50Chapter 4.2 --- Line Concept --- p.56Chapter 4.2.1 --- Rationale --- p.56Chapter 4.2.2 --- "Improvement Over ""Pure"" Algorithm" --- p.57Chapter 4.2.3 --- Architectural Model --- p.59Chapter 4.3 --- Combined L1-L2 Cache Management --- p.62Chapter 4.3.1 --- Rationale --- p.62Chapter 4.3.2 --- Feasibility --- p.63Chapter 4.4 --- Combine SIRPA with Default Prefetch --- p.66Chapter 4.4.1 --- Rationale --- p.67Chapter 4.4.2 --- Improvement Over “Pure嫯 Algorithm --- p.69Chapter 4.4.3 --- Architectural Model --- p.70Chapter 5 --- Results --- p.73Chapter 5.1 --- Benchmarks Used --- p.73Chapter 5.1.1 --- SPEC92int and SPEC92fp --- p.75Chapter 5.2 --- Configurations Tested --- p.79Chapter 5.2.1 --- Prefetch Algorithms --- p.79Chapter 5.2.2 --- Cache Sizes --- p.80Chapter 5.2.3 --- Cache Block Sizes --- p.81Chapter 5.2.4 --- Cache Set Associativities --- p.81Chapter 5.2.5 --- "Bus Width, Speed and Other Parameters" --- p.81Chapter 5.3 --- Validity of Results --- p.83Chapter 5.3.1 --- Total Instructions and Cycles --- p.83Chapter 5.3.2 --- Total Reference to Caches --- p.84Chapter 5.4 --- Overall MCPI Comparison --- p.86Chapter 5.4.1 --- Cache Size Effect --- p.87Chapter 5.4.2 --- Cache Block Size Effect --- p.91Chapter 5.4.3 --- Set Associativity Effect --- p.101Chapter 5.4.4 --- Hardware Prefetch Algorithms --- p.108Chapter 5.4.5 --- Software Based Prefetch Algorithms --- p.119Chapter 5.5 --- L2 Cache & Main Memory MCPI Comparison --- p.127Chapter 5.5.1 --- Cache Size Effect --- p.130Chapter 5.5.2 --- Cache Block Size Effect --- p.130Chapter 5.5.3 --- Set Associativity Effect --- p.143Chapter 6 --- Conclusion --- p.154Chapter 7 --- Future Directions --- p.157Chapter 7.1 --- Prefetch Buffer --- p.157Chapter 7.2 --- Dissimilar L1-L2 Management --- p.158Chapter 7.3 --- Combined LRU/MRU Replacement Policy --- p.160Chapter 7.4 --- N Loops Look-ahead --- p.16
Data prefetching using hardware register value predictable table.
by Chin-Ming, Cheung.Thesis (M.Phil.)--Chinese University of Hong Kong, 1996.Includes bibliographical references (leaves 95-97).Abstract --- p.iAcknowledgement --- p.iiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Overview --- p.1Chapter 1.2 --- Objective --- p.3Chapter 1.3 --- Organization of the dissertation --- p.4Chapter 2 --- Related Works --- p.6Chapter 2.1 --- Previous Cache Works --- p.6Chapter 2.2 --- Data Prefetching Techniques --- p.7Chapter 2.2.1 --- Hardware Vs Software Assisted --- p.7Chapter 2.2.2 --- Non-selective Vs Highly Selective --- p.8Chapter 2.2.3 --- Summary on Previous Data Prefetching Schemes --- p.12Chapter 3 --- Program Data Mapping --- p.13Chapter 3.1 --- Regular and Irregular Data Access --- p.13Chapter 3.2 --- Propagation of Data Access Regularity --- p.16Chapter 3.2.1 --- Data Access Regularity in High Level Program --- p.17Chapter 3.2.2 --- Data Access Regularity in Machine Code --- p.18Chapter 3.2.3 --- Data Access Regularity in Memory Address Sequence --- p.20Chapter 3.2.4 --- Implication --- p.21Chapter 4 --- Register Value Prediction Table (RVPT) --- p.22Chapter 4.1 --- Predictability of Register Values --- p.23Chapter 4.2 --- Register Value Prediction Table --- p.26Chapter 4.3 --- Control Scheme of RVPT --- p.29Chapter 4.3.1 --- Details of RVPT Mechanism --- p.29Chapter 4.3.2 --- Explanation of the Register Prediction Mechanism --- p.32Chapter 4.4 --- Examples of RVPT --- p.35Chapter 4.4.1 --- Linear Array Example --- p.35Chapter 4.4.2 --- Linked List Example --- p.36Chapter 5 --- Program Register Dependency --- p.39Chapter 5.1 --- Register Dependency --- p.40Chapter 5.2 --- Generalized Concept of Register --- p.44Chapter 5.2.1 --- Cyclic Dependent Register(CDR) --- p.44Chapter 5.2.2 --- Acyclic Dependent Register(ADR) --- p.46Chapter 5.3 --- Program Register Overview --- p.47Chapter 6 --- Generalized RVPT Model --- p.49Chapter 6.1 --- Level N RVPT Model --- p.49Chapter 6.1.1 --- Identification of Level N CDR --- p.51Chapter 6.1.2 --- Recording CDR instructions of Level N CDR --- p.53Chapter 6.1.3 --- Prediction of Level N CDR --- p.55Chapter 6.2 --- Level 2 Register Value Prediction Table --- p.55Chapter 6.2.1 --- Level 2 RVPT Structure --- p.56Chapter 6.2.2 --- Identification of Level 2 CDR --- p.58Chapter 6.2.3 --- Control Scheme of Level 2 RVPT --- p.59Chapter 6.2.4 --- Example of Index Array --- p.63Chapter 7 --- Performance Evaluation --- p.66Chapter 7.1 --- Evaluation Methodology --- p.66Chapter 7.1.1 --- Trace-Drive Simulation --- p.66Chapter 7.1.2 --- Architectural Method --- p.68Chapter 7.1.3 --- Benchmarks and Metrics --- p.70Chapter 7.2 --- General Result --- p.75Chapter 7.2.1 --- Constant Stride or Regular Data Access Applications --- p.77Chapter 7.2.2 --- Non-constant Stride or Irregular Data Access Applications --- p.79Chapter 7.3 --- Effect of Design Variations --- p.80Chapter 7.3.1 --- Effect of Cache Size --- p.81Chapter 7.3.2 --- Effect of Block Size --- p.83Chapter 7.3.3 --- Effect of Set Associativity --- p.86Chapter 7.4 --- Summary --- p.87Chapter 8 --- Conclusion and Future Research --- p.88Chapter 8.1 --- Conclusion --- p.88Chapter 8.2 --- Future Research --- p.90Bibliography --- p.95Appendix --- p.98Chapter A --- MCPI vs. cache size --- p.98Chapter B --- MCPI Reduction Percentage Vs cache size --- p.102Chapter C --- MCPI vs. block size --- p.106Chapter D --- MCPI Reduction Percentage Vs block size --- p.110Chapter E --- MCPI vs. set-associativity --- p.114Chapter F --- MCPI Reduction Percentage Vs set-associativity --- p.11
Recommended from our members
Techniques for advancing value prediction
Sequential performance is still an issue in computing. While some prediction mechanisms such as branch prediction and prefetching have been widely adopted in modern, general-purpose microprocessors, others such as value prediction have not been accepted due to their high area and misprediction overheads. True data dependences form a major bottleneck in sequential performance and value prediction can be employed to speculatively resolve these dependences. Accurate predictors [1] [2] have been shown to provide performance benefits, albeit requiring a large predictor state. We argue that a first step in making value prediction practical is to manage the metadata associated with the predictor effectively. Inspired by irregular prefetchers that store their metadata in off-chip memory, we propose the use of an improved prefetching mechanism for value prediction that not only provides performance benefits but also a means to off-load predictor state to the memory hierarchy. We show an average of 5.3% IPC improvements across a set of Qualcomm-provided traces [3].
The result of a static instruction can be predicted by mapping runtime context information to the value produced by the instruction. To that end, existing value predictors either use branch history contexts [2] or value history contexts [1] to make predictions. As long histories are needed to achieve high accuracy, these approaches slow down the training time of the predictor, negatively impacting coverage. We identify that branch and value histories both provide distinct advantages to a value predictor, and therefore combine them in a novel predictor design called the Relevant Context-based Predictor (RCP) that maintains high accuracy while improving training time. We show an average of 38% speedup over a baseline that performs no value prediction on the Qualcomm-provided traces, compared to 34% by the previous best.Electrical and Computer Engineerin
ADAM : a decentralized parallel computer architecture featuring fast thread and data migration and a uniform hardware abstraction
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2002.Includes bibliographical references (p. 247-256).The furious pace of Moore's Law is driving computer architecture into a realm where the the speed of light is the dominant factor in system latencies. The number of clock cycles to span a chip are increasing, while the number of bits that can be accessed within a clock cycle is decreasing. Hence, it is becoming more difficult to hide latency. One alternative solution is to reduce latency by migrating threads and data, but the overhead of existing implementations has previously made migration an unserviceable solution so far. I present an architecture, implementation, and mechanisms that reduces the overhead of migration to the point where migration is a viable supplement to other latency hiding mechanisms, such as multithreading. The architecture is abstract, and presents programmers with a simple, uniform fine-grained multithreaded parallel programming model with implicit memory management. In other words, the spatial nature and implementation details (such as the number of processors) of a parallel machine are entirely hidden from the programmer. Compiler writers are encouraged to devise programming languages for the machine that guide a programmer to express their ideas in terms of objects, since objects exhibit an inherent physical locality of data and code. The machine implementation can then leverage this locality to automatically distribute data and threads across the physical machine by using a set of high performance migration mechanisms.(cont.) An implementation of this architecture could migrate a null thread in 66 cycles - over a factor of 1000 improvement over previous work. Performance also scales well; the time required to move a typical thread is only 4 to 5 times that of a null thread. Data migration performance is similar, and scales linearly with data block size. Since the performance of the migration mechanism is on par with that of an L2 cache, the implementation simulated in my work has no data caches and relies instead on multithreading and the migration mechanism to hide and reduce access latencies.by Andrew "bunnie" Huang.Ph.D
Recommended from our members
Exploiting tightly-coupled cores
As we move steadily through the multicore era, and the number of processing cores on each chip continues to rise, parallel computation becomes increasingly important. However, parallelising an application is often difficult because of dependencies between different regions of code which require cores to communicate. Communication is usually slow compared to computation, and so restricts the opportunities for profitable parallelisation. In this work, I explore the opportunities provided when communication between cores has a very low latency and low energy cost. I observe that there are many different ways in which multiple cores can be used to execute a program, allowing more parallelism to be exploited in more situations, and also providing energy savings in some cases. Individual cores can be made very simple and efficient because they do not need to exploit parallelism internally. The communication patterns between cores can be updated frequently to reflect the parallelism available at the time, allowing better utilisation than specialised hardware which is used infrequently.
In this dissertation I introduce Loki: a homogeneous, tiled architecture made up of many simple, tightly-coupled cores. I demonstrate the benefits in both performance and energy consumption which can be achieved with this arrangement and observe that it is also likely to have lower design and validation costs and be easier to optimise. I then determine exactly where the performance bottlenecks of the design are, and where the energy is consumed, and look into some more-advanced optimisations which can make parallelism even more profitable
ADAM: A Decentralized Parallel Computer Architecture Featuring Fast Thread and Data Migration and a Uniform Hardware Abstraction
The furious pace of Moore's Law is driving computer architecture into a realm where the the speed of light is the dominant factor in system latencies. The number of clock cycles to span a chip are increasing, while the number of bits that can be accessed within a clock cycle is decreasing. Hence, it is becoming more difficult to hide latency. One alternative solution is to reduce latency by migrating threads and data, but the overhead of existing implementations has previously made migration an unserviceable solution so far. I present an architecture, implementation, and mechanisms that reduces the overhead of migration to the point where migration is a viable supplement to other latency hiding mechanisms, such as multithreading. The architecture is abstract, and presents programmers with a simple, uniform fine-grained multithreaded parallel programming model with implicit memory management. In other words, the spatial nature and implementation details (such as the number of processors) of a parallel machine are entirely hidden from the programmer. Compiler writers are encouraged to devise programming languages for the machine that guide a programmer to express their ideas in terms of objects, since objects exhibit an inherent physical locality of data and code. The machine implementation can then leverage this locality to automatically distribute data and threads across the physical machine by using a set of high performance migration mechanisms. An implementation of this architecture could migrate a null thread in 66 cycles -- over a factor of 1000 improvement over previous work. Performance also scales well; the time required to move a typical thread is only 4 to 5 times that of a null thread. Data migration performance is similar, and scales linearly with data block size. Since the performance of the migration mechanism is on par with that of an L2 cache, the implementation simulated in my work has no data caches and relies instead on multithreading and the migration mechanism to hide and reduce access latencies
AMC: Advanced Multi-accelerator Controller
The rapid advancement, use of diverse architectural features and introduction of High Level Synthesis (HLS) tools in FPGA technology have enhanced the capacity of data-level parallelism on a chip. A generic FPGA based HLS multi-accelerator system requires a microprocessor (master core) that manages memory and schedules accelerators. In a real environment, such HLS multi-accelerator systems do not give a perfect performance due to memory bandwidth issues. Thus, a system demands a memory manager and a scheduler that improves performance by managing and scheduling the multi-accelerator’s memory access patterns efficiently. In this article, we propose the integration of an intelligent memory system and efficient scheduler in the HLS-based multi-accelerator environment called Advanced Multi-accelerator Controller (AMC). The AMC system is evaluated with memory intensive accelerators, High Performance Computing (HPC) applications and implemented and tested on a Xilinx Virtex-5 ML505 evaluation FPGA board. The performance of the system is compared against the microprocessor-based systems that have been integrated with the operating system. Results show that the AMC based HLS multi-accelerator system achieves 10.4x and 7x of speedup compared to the MicroBlaze and Intel Core based HLS multi-accelerator systems.Peer ReviewedPostprint (author’s final draft