66 research outputs found

    Cache performance of the SPEC92 benchmark suite

    Full text link

    Empirical study of parallel LRU simulation algorithms

    Get PDF
    This paper reports on the performance of five parallel algorithms for simulating a fully associative cache operating under the LRU (Least-Recently-Used) replacement policy. Three of the algorithms are SIMD, and are implemented on the MasPar MP-2 architecture. Two other algorithms are parallelizations of an efficient serial algorithm on the Intel Paragon. One SIMD algorithm is quite simple, but its cost is linear in the cache size. The two other SIMD algorithm are more complex, but have costs that are independent on the cache size. Both the second and third SIMD algorithms compute all stack distances; the second SIMD algorithm is completely general, whereas the third SIMD algorithm presumes and takes advantage of bounds on the range of reference tags. Both MIMD algorithm implemented on the Paragon are general and compute all stack distances; they differ in one step that may affect their respective scalability. We assess the strengths and weaknesses of these algorithms as a function of problem size and characteristics, and compare their performance on traces derived from execution of three SPEC benchmark programs

    Improving on-chip data cache using instruction register information.

    Get PDF
    by Lau Siu Chung.Thesis (M.Phil.)--Chinese University of Hong Kong, 1996.Includes bibliographical references (leaves 71-74).Abstract --- p.iAcknowledgment --- p.iiList of Figures --- p.vChapter Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Hiding memory latency --- p.1Chapter 1.2 --- Organization of dissertation --- p.4Chapter Chapter 2 --- Related Work --- p.5Chapter 2.1 --- Hardware controlled cache prefetching --- p.5Chapter 2.2 --- Software assisted cache prefetching --- p.9Chapter Chapter 3 --- Data Prefetching --- p.13Chapter 3.1 --- Data reference patterns --- p.14Chapter 3.2 --- Embedded hints for next data references --- p.19Chapter 3.3 --- Instruction Opcode and Addressing Mode Prefetching scheme --- p.21Chapter 3.3.1 --- Basic IAP scheme --- p.21Chapter 3.3.2 --- Enhanced IAP scheme --- p.24Chapter 3.3.3 --- Combined IAP scheme --- p.27Chapter 3.4 --- Summary --- p.29Chapter Chapter 4 --- Performance Evaluation --- p.31Chapter 4.1 --- Evaluation methodology --- p.31Chapter 4.1.1 --- Trace-driven simulation --- p.31Chapter 4.1.2 --- Caching models --- p.33Chapter 4.1.3 --- Benchmarks and metrics --- p.36Chapter 4.2 --- General Results --- p.41Chapter 4.2.1 --- Varying cache size --- p.44Chapter 4.2.2 --- Varying cache block size --- p.46Chapter 4.2.3 --- Varying associativity --- p.49Chapter 4.3 --- Other performance metrics --- p.52Chapter 4.3.1 --- Accuracy of prefetch --- p.52Chapter 4.3.2 --- Partial hit delay --- p.55Chapter 4.3.3 --- Bus usage problem --- p.59Chapter 4.4 --- Zero time prefetch --- p.63Chapter 4.5 --- Summary --- p.67Chapter Chapter 5 --- Conclusion --- p.68Chapter 5.1 --- Summary of our research --- p.68Chapter 5.2 --- Future work --- p.70Bibliography --- p.7

    Influence of Input/output Operations on Processor Performance

    Get PDF
    Nowadays, computers are frequently equipped with peripherals that transfer great amounts of data between them and the system memory using direct memory access techniques (i.e., digital cameras, high speed networks, . . . ). Those peripherals prevent the processor from accessing system memory for significant periods of time (i.e., while they are communicating with system memory in order to send or receive data blocks). In this paper we study the negative effects that I/O operations from computer peripherals have on processor performance. With the help of a set of routines (SMPL) used to make discrete event simulators, we have developed a configurable software that simulates a computer processor and main memory as well as the I/O scenarios where the periph-erals operate. This software has been used to analyze the performance of four different processors in four I/O scenarios: video capture, video capture and playback, high speed network, and serial transmission

    Unified on-chip multi-level cache management scheme using processor opcodes and addressing modes.

    Get PDF
    by Stephen Siu-ming Wong.Thesis (M.Phil.)--Chinese University of Hong Kong, 1996.Includes bibliographical references (leaves 164-170).Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Cache Memory --- p.2Chapter 1.2 --- System Performance --- p.3Chapter 1.3 --- Cache Performance --- p.3Chapter 1.4 --- Cache Prefetching --- p.5Chapter 1.5 --- Organization of Dissertation --- p.7Chapter 2 --- Related Work --- p.8Chapter 2.1 --- Memory Hierarchy --- p.8Chapter 2.2 --- Cache Memory Management --- p.10Chapter 2.2.1 --- Configuration --- p.10Chapter 2.2.2 --- Replacement Algorithms --- p.13Chapter 2.2.3 --- Write Back Policies --- p.15Chapter 2.2.4 --- Cache Miss Types --- p.16Chapter 2.2.5 --- Prefetching --- p.17Chapter 2.3 --- Locality --- p.18Chapter 2.3.1 --- Spatial vs. Temporal --- p.18Chapter 2.3.2 --- Instruction Cache vs. Data Cache --- p.20Chapter 2.4 --- Why Not a Large L1 Cache? --- p.26Chapter 2.4.1 --- Critical Time Path --- p.26Chapter 2.4.2 --- Hardware Cost --- p.27Chapter 2.5 --- Trend to have L2 Cache On Chip --- p.28Chapter 2.5.1 --- Examples --- p.29Chapter 2.5.2 --- Dedicated L2 Bus --- p.31Chapter 2.6 --- Hardware Prefetch Algorithms --- p.32Chapter 2.6.1 --- One Block Look-ahead --- p.33Chapter 2.6.2 --- Chen's RPT & similar algorithms --- p.34Chapter 2.7 --- Software Based Prefetch Algorithm --- p.38Chapter 2.7.1 --- Prefetch Instruction --- p.38Chapter 2.8 --- Hybrid Prefetch Algorithm --- p.40Chapter 2.8.1 --- Stride CAM Prefetching --- p.40Chapter 3 --- Simulator --- p.43Chapter 3.1 --- Multi-level Memory Hierarchy Simulator --- p.43Chapter 3.1.1 --- Multi-level Memory Support --- p.45Chapter 3.1.2 --- Non-blocking Cache --- p.45Chapter 3.1.3 --- Cycle-by-cycle Simulation --- p.47Chapter 3.1.4 --- Cache Prefetching Support --- p.47Chapter 4 --- Proposed Algorithms --- p.48Chapter 4.1 --- SIRPA --- p.48Chapter 4.1.1 --- Rationale --- p.48Chapter 4.1.2 --- Architecture Model --- p.50Chapter 4.2 --- Line Concept --- p.56Chapter 4.2.1 --- Rationale --- p.56Chapter 4.2.2 --- "Improvement Over ""Pure"" Algorithm" --- p.57Chapter 4.2.3 --- Architectural Model --- p.59Chapter 4.3 --- Combined L1-L2 Cache Management --- p.62Chapter 4.3.1 --- Rationale --- p.62Chapter 4.3.2 --- Feasibility --- p.63Chapter 4.4 --- Combine SIRPA with Default Prefetch --- p.66Chapter 4.4.1 --- Rationale --- p.67Chapter 4.4.2 --- Improvement Over “Pure´ح Algorithm --- p.69Chapter 4.4.3 --- Architectural Model --- p.70Chapter 5 --- Results --- p.73Chapter 5.1 --- Benchmarks Used --- p.73Chapter 5.1.1 --- SPEC92int and SPEC92fp --- p.75Chapter 5.2 --- Configurations Tested --- p.79Chapter 5.2.1 --- Prefetch Algorithms --- p.79Chapter 5.2.2 --- Cache Sizes --- p.80Chapter 5.2.3 --- Cache Block Sizes --- p.81Chapter 5.2.4 --- Cache Set Associativities --- p.81Chapter 5.2.5 --- "Bus Width, Speed and Other Parameters" --- p.81Chapter 5.3 --- Validity of Results --- p.83Chapter 5.3.1 --- Total Instructions and Cycles --- p.83Chapter 5.3.2 --- Total Reference to Caches --- p.84Chapter 5.4 --- Overall MCPI Comparison --- p.86Chapter 5.4.1 --- Cache Size Effect --- p.87Chapter 5.4.2 --- Cache Block Size Effect --- p.91Chapter 5.4.3 --- Set Associativity Effect --- p.101Chapter 5.4.4 --- Hardware Prefetch Algorithms --- p.108Chapter 5.4.5 --- Software Based Prefetch Algorithms --- p.119Chapter 5.5 --- L2 Cache & Main Memory MCPI Comparison --- p.127Chapter 5.5.1 --- Cache Size Effect --- p.130Chapter 5.5.2 --- Cache Block Size Effect --- p.130Chapter 5.5.3 --- Set Associativity Effect --- p.143Chapter 6 --- Conclusion --- p.154Chapter 7 --- Future Directions --- p.157Chapter 7.1 --- Prefetch Buffer --- p.157Chapter 7.2 --- Dissimilar L1-L2 Management --- p.158Chapter 7.3 --- Combined LRU/MRU Replacement Policy --- p.160Chapter 7.4 --- N Loops Look-ahead --- p.16

    Replacement and placement policies for prefetched lines.

    Get PDF
    by Sze Siu Ching.Thesis (M.Phil.)--Chinese University of Hong Kong, 1998.Includes bibliographical references (leaves 119-122).Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Overlapping Computations with Memory Accesses --- p.3Chapter 1.2 --- Cache Line Replacement Policies --- p.4Chapter 1.3 --- The Rest of This Paper --- p.4Chapter 2 --- A Brief Review of IAP Scheme --- p.6Chapter 2.1 --- Embedded Hints for Next Data References --- p.6Chapter 2.2 --- Instruction Opcode and Addressing Mode Prefetching --- p.8Chapter 2.3 --- Chapter Summary --- p.9Chapter 3 --- Motivation --- p.11Chapter 3.1 --- Chapter Summary --- p.14Chapter 4 --- Related Work --- p.15Chapter 4.1 --- Existing Replacement Algorithms --- p.16Chapter 4.2 --- Placement Policies for Cache Lines --- p.18Chapter 4.3 --- Chapter Summary --- p.20Chapter 5 --- Replacement and Placement Policies of Prefetched Lines --- p.21Chapter 5.1 --- IZ Cache Line Replacement Policy in IAP scheme --- p.22Chapter 5.1.1 --- The Instant Zero Scheme --- p.23Chapter 5.2 --- Priority Pre-Updating and Victim Cache --- p.27Chapter 5.2.1 --- Priority Pre-Updating --- p.27Chapter 5.2.2 --- Priority Pre-Updating for Cache --- p.28Chapter 5.2.3 --- Victim Cache for Unreferenced Prefetch Lines --- p.28Chapter 5.3 --- Prefetch Cache for IAP Lines --- p.31Chapter 5.4 --- Chapter Summary --- p.33Chapter 6 --- Performance Evaluation --- p.34Chapter 6.1 --- Methodology and metrics --- p.34Chapter 6.1.1 --- Trace Driven Simulation --- p.35Chapter 6.1.2 --- Caching Models --- p.36Chapter 6.1.3 --- Simulation Models and Performance Metrics --- p.39Chapter 6.2 --- Simulation Results --- p.43Chapter 6.2.1 --- General Results --- p.44Chapter 6.3 --- Simulation Results of IZ Replacement Policy --- p.49Chapter 6.3.1 --- Analysis To IZ Cache Line Replacement Policy --- p.50Chapter 6.4 --- Simulation Results for Priority Pre-Updating with Victim Cache --- p.52Chapter 6.4.1 --- PPUVC in Cache with IAP Scheme --- p.52Chapter 6.4.2 --- PPUVC in prefetch-on-miss Cache --- p.54Chapter 6.5 --- Prefetch Cache --- p.57Chapter 6.6 --- Chapter Summary --- p.63Chapter 7 --- Architecture Without LOAD-AND-STORE Instructions --- p.64Chapter 8 --- Conclusion --- p.66Chapter A --- CPI Due to Cache Misses --- p.68Chapter A.1 --- Varying Cache Size --- p.68Chapter A.1.1 --- Instant Zero Replacement Policy --- p.68Chapter A.1.2 --- Priority Pre-Updating with Victim Cache --- p.70Chapter A.1.3 --- Prefetch Cache --- p.73Chapter A.2 --- Varying Cache Line Size --- p.75Chapter A.2.1 --- Instant Zero Replacement Policy --- p.75Chapter A.2.2 --- Priority Pre-Updating with Victim Cache --- p.77Chapter A.2.3 --- Prefetch Cache --- p.80Chapter A.3 --- Varying Cache Set Associative --- p.82Chapter A.3.1 --- Instant Zero Replacement Policy --- p.82Chapter A.3.2 --- Priority Pre-Updating with Victim Cache --- p.84Chapter A.3.3 --- Prefetch Cache --- p.87Chapter B --- Simulation Results of IZ Replacement Policy --- p.89Chapter B.1 --- Memory Delay Time Reduction --- p.89Chapter B.1.1 --- Varying Cache Size --- p.89Chapter B.1.2 --- Varying Cache Line Size --- p.91Chapter B.1.3 --- Varying Cache Set Associative --- p.93Chapter C --- Simulation Results of Priority Pre-Updating with Victim Cache --- p.95Chapter C.1 --- PPUVC in IAP Scheme --- p.95Chapter C.1.1 --- Memory Delay Time Reduction --- p.95Chapter C.2 --- PPUVC in Cache with Prefetch-On-Miss Only --- p.101Chapter C.2.1 --- Memory Delay Time Reduction --- p.101Chapter D --- Simulation Results of Prefetch Cache --- p.107Chapter D.1 --- Memory Delay Time Reduction --- p.107Chapter D.1.1 --- Varying Cache Size --- p.107Chapter D.1.2 --- Varying Cache Line Size --- p.109Chapter D.1.3 --- Varying Cache Set Associative --- p.111Chapter D.2 --- Results of the Three Replacement Policies --- p.113Chapter D.2.1 --- Varying Cache Size --- p.113Chapter D.2.2 --- Varying Cache Line Size --- p.115Chapter D.2.3 --- Varying Cache Set Associative --- p.117Bibliography --- p.11

    Spatial instruction scheduling for raw machines

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2002.Includes bibliographical references (leaves 89-91).Instruction scheduling on software exposed architectures, such as Raw, must be performed in both time and space. The complexity and variance of application scheduling regions dictates that the space-time scheduling task be divided into phases. Unfortunately, the interaction of phases presents a phase ordering problem. In this thesis, the structure of program scheduling regions is studied. The scheduling regions are shown to have varying characteristics that are too diverse for a single simple algorithm to cover. A new scheduling technique is proposed to cope with this diversity and minimize the phase ordering problem. First, rather than maintaining exact mappings of instructions to time and space, the internal state of the scheduler maintains probabilities for different assignments of instructions to time and space resources. Second, a set of small scheduling heuristics cooperatively iterate over the probabilistic assignments many times in order to minimize the effects of phase ordering. A simple spatial instruction scheduler for Raw machines based on this technique is implemented and shown to outperform existing spatial scheduling systems on average.by Shane Michael Swenson.M.Eng

    An Interpolative Analytical Cache Model with Application to Performance-Power Design Space Exploration

    Get PDF
    Caches are known to consume up to half of all system power in embedded processors. Co-optimizing performance and power of the cache subsystems is therefore an important step in the design of embedded systems, especially those employing application specific instruction processors. In this project, we propose an analytical cache model that succinctly captures the miss performance of an application over the entire cache parameter space. Unlike exhaustive trace driven simulation, our model requires that the program be simulated once so that a few key characteristics can be obtained. Using these application-dependent characteristics, the model can span the entire cache parameter space consisting of cache sizes, associativity and cache block sizes. In our unified model, we are able to cater for direct-mapped, set and fully associative instruction, data and unified caches. Validation against full trace-driven simulations shows that our model has a high degree of fidelity. Finally, we show how the model can be coupled with a power model for caches such that one can very quickly decide on pareto-optimal performance-power design points for rapid design space exploration.Singapore-MIT Alliance (SMA
    • …
    corecore