2 research outputs found

    Unified on-chip multi-level cache management scheme using processor opcodes and addressing modes.

    Get PDF
    by Stephen Siu-ming Wong.Thesis (M.Phil.)--Chinese University of Hong Kong, 1996.Includes bibliographical references (leaves 164-170).Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Cache Memory --- p.2Chapter 1.2 --- System Performance --- p.3Chapter 1.3 --- Cache Performance --- p.3Chapter 1.4 --- Cache Prefetching --- p.5Chapter 1.5 --- Organization of Dissertation --- p.7Chapter 2 --- Related Work --- p.8Chapter 2.1 --- Memory Hierarchy --- p.8Chapter 2.2 --- Cache Memory Management --- p.10Chapter 2.2.1 --- Configuration --- p.10Chapter 2.2.2 --- Replacement Algorithms --- p.13Chapter 2.2.3 --- Write Back Policies --- p.15Chapter 2.2.4 --- Cache Miss Types --- p.16Chapter 2.2.5 --- Prefetching --- p.17Chapter 2.3 --- Locality --- p.18Chapter 2.3.1 --- Spatial vs. Temporal --- p.18Chapter 2.3.2 --- Instruction Cache vs. Data Cache --- p.20Chapter 2.4 --- Why Not a Large L1 Cache? --- p.26Chapter 2.4.1 --- Critical Time Path --- p.26Chapter 2.4.2 --- Hardware Cost --- p.27Chapter 2.5 --- Trend to have L2 Cache On Chip --- p.28Chapter 2.5.1 --- Examples --- p.29Chapter 2.5.2 --- Dedicated L2 Bus --- p.31Chapter 2.6 --- Hardware Prefetch Algorithms --- p.32Chapter 2.6.1 --- One Block Look-ahead --- p.33Chapter 2.6.2 --- Chen's RPT & similar algorithms --- p.34Chapter 2.7 --- Software Based Prefetch Algorithm --- p.38Chapter 2.7.1 --- Prefetch Instruction --- p.38Chapter 2.8 --- Hybrid Prefetch Algorithm --- p.40Chapter 2.8.1 --- Stride CAM Prefetching --- p.40Chapter 3 --- Simulator --- p.43Chapter 3.1 --- Multi-level Memory Hierarchy Simulator --- p.43Chapter 3.1.1 --- Multi-level Memory Support --- p.45Chapter 3.1.2 --- Non-blocking Cache --- p.45Chapter 3.1.3 --- Cycle-by-cycle Simulation --- p.47Chapter 3.1.4 --- Cache Prefetching Support --- p.47Chapter 4 --- Proposed Algorithms --- p.48Chapter 4.1 --- SIRPA --- p.48Chapter 4.1.1 --- Rationale --- p.48Chapter 4.1.2 --- Architecture Model --- p.50Chapter 4.2 --- Line Concept --- p.56Chapter 4.2.1 --- Rationale --- p.56Chapter 4.2.2 --- "Improvement Over ""Pure"" Algorithm" --- p.57Chapter 4.2.3 --- Architectural Model --- p.59Chapter 4.3 --- Combined L1-L2 Cache Management --- p.62Chapter 4.3.1 --- Rationale --- p.62Chapter 4.3.2 --- Feasibility --- p.63Chapter 4.4 --- Combine SIRPA with Default Prefetch --- p.66Chapter 4.4.1 --- Rationale --- p.67Chapter 4.4.2 --- Improvement Over “Pure´ح Algorithm --- p.69Chapter 4.4.3 --- Architectural Model --- p.70Chapter 5 --- Results --- p.73Chapter 5.1 --- Benchmarks Used --- p.73Chapter 5.1.1 --- SPEC92int and SPEC92fp --- p.75Chapter 5.2 --- Configurations Tested --- p.79Chapter 5.2.1 --- Prefetch Algorithms --- p.79Chapter 5.2.2 --- Cache Sizes --- p.80Chapter 5.2.3 --- Cache Block Sizes --- p.81Chapter 5.2.4 --- Cache Set Associativities --- p.81Chapter 5.2.5 --- "Bus Width, Speed and Other Parameters" --- p.81Chapter 5.3 --- Validity of Results --- p.83Chapter 5.3.1 --- Total Instructions and Cycles --- p.83Chapter 5.3.2 --- Total Reference to Caches --- p.84Chapter 5.4 --- Overall MCPI Comparison --- p.86Chapter 5.4.1 --- Cache Size Effect --- p.87Chapter 5.4.2 --- Cache Block Size Effect --- p.91Chapter 5.4.3 --- Set Associativity Effect --- p.101Chapter 5.4.4 --- Hardware Prefetch Algorithms --- p.108Chapter 5.4.5 --- Software Based Prefetch Algorithms --- p.119Chapter 5.5 --- L2 Cache & Main Memory MCPI Comparison --- p.127Chapter 5.5.1 --- Cache Size Effect --- p.130Chapter 5.5.2 --- Cache Block Size Effect --- p.130Chapter 5.5.3 --- Set Associativity Effect --- p.143Chapter 6 --- Conclusion --- p.154Chapter 7 --- Future Directions --- p.157Chapter 7.1 --- Prefetch Buffer --- p.157Chapter 7.2 --- Dissimilar L1-L2 Management --- p.158Chapter 7.3 --- Combined LRU/MRU Replacement Policy --- p.160Chapter 7.4 --- N Loops Look-ahead --- p.16

    An automated OpenCL FPGA compilation framework targeting a configurable, VLIW chip multiprocessor

    Get PDF
    Modern system-on-chips augment their baseline CPU with coprocessors and accelerators to increase overall computational capacity and power efficiency, and thus have evolved into heterogeneous systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This thesis discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a customised VLIW chip multiprocessor (CMP) architecture, known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on the LE1 CPU. The framework fully automates the compilation flow and supports work-item coalescing to better utilise the CPU cores and alleviate the effects of thread divergence. This thesis discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework on a highly precise cycle-accurate simulator. This is achieved through the execution of 12 benchmarks across 240 different machine configurations, as well as further results utilising an incomplete development branch of the compiler. It is shown that the problems generally scale well with the LE1 architecture, up to eight cores, when the memory system becomes a serious bottleneck. Results demonstrate superlinear performance on certain benchmarks (x9 for the bitonic sort benchmark with 8 dual-issue cores) with further improvements from compiler optimisations (x14 for bitonic with the same configuration
    corecore