6 research outputs found

    Dynamic data shapers optimize performance in Dynamic Binary Optimization (DBO) environment

    Get PDF
    Processor hardware has been architected with the assumption that most data access patterns would be linearly spatial in nature. But, most applications involve algorithms that are designed with optimal efficiency in mind, which results in non-spatial, multi-dimensional data access. Moreover, this data view or access pattern changes dynamically in different program phases. This results in a mismatch between the processor hardware\u27s view of data and the algorithmic view of data, leading to significant memory access bottlenecks. This variation in data views is especially more pronounced in applications involving large datasets, leading to significantly increased latency and user response times. Previous attempts to tackle this problem were primarily targeted at execution time optimization. We present a dynamic technique piggybacked on the classical dynamic binary optimization (DBO) to shape the data view for each program phase differently resulting in program execution time reduction along with reductions in access energy. Our implementation rearranges non-adjacent data into a contiguous dataview. It uses wrappers to replace irregular data access patterns with spatially local dataview. HDTrans, a runtime dynamic binary optimization framework has been used to perform runtime instrumentation and dynamic data optimization to achieve this goal. This scheme not only ensures a reduced program execution time, but also results in lower energy use. Some of the commonly used benchmarks from the SPEC 2006 suite were profiled to determine irregular data accesses from procedures which contributed heavily to the overall execution time. Wrappers built to replace these accesses with spatially adjacent data led to a significant improvement in the total execution time. On average, 20% reduction in time was achieved along with a 5% reduction in energy

    Dynamic Data Shapers Optimize Performance in Dynamic Binary Optimization (DBO) Environment

    Full text link

    Tolerating First Level Memory Access Latency In High-Performance Systems

    No full text
    In order to improve performance, future parallel systems will continue to increase the processing power of each node in a system. As node processors, though, can execute more instructions concurrently, they become more sensitive to the first level memory access latency. This paper presents a set of hardware and software techniques, collectively referred to as register preloading, to effectively tolerate long first level memory access latency. The techniques include speculative execution, loop unrolling, dynamic memory disambiguation, and strip-mining. Results show that register preloading provides excellent tolerance to first level memory access latency up to 16 cycles for an issue 4 node processor. INTRODUCTION The objective of designing a high-performance system is to speed up the execution of application programs. An important approach to achieve this objective is to exploit program parallelism at both the instruction level and the multiprocessor level. For example, the Alliant FX/..

    Software-assisted data prefetching algorithms.

    Get PDF
    by Chi-sum, Ho.Thesis (M.Phil.)--Chinese University of Hong Kong, 1995.Includes bibliographical references (leaves 110-113).Abstract --- p.iAcknowledgement --- p.iiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Overview --- p.1Chapter 1.2 --- Cache Memories --- p.1Chapter 1.3 --- Improving Cache Performance --- p.3Chapter 1.4 --- Improving System Performance --- p.4Chapter 1.5 --- Organization of the dissertation --- p.6Chapter 2 --- Related Work --- p.8Chapter 2.1 --- Cache Performance --- p.8Chapter 2.2 --- Non-Blocking Cache --- p.9Chapter 2.3 --- Cache Prefetching --- p.10Chapter 2.3.1 --- Hardware Prefetching --- p.10Chapter 2.3.2 --- Software-assisted Prefetching --- p.13Chapter 2.3.3 --- Improving Cache Effectiveness --- p.22Chapter 2.4 --- Other Techniques to Reduce and Hide Memory Latencies --- p.25Chapter 2.4.1 --- Register Preloading --- p.25Chapter 2.4.2 --- Write Policies --- p.26Chapter 2.4.3 --- Small Specialized Cache --- p.26Chapter 2.4.4 --- Program Transformation --- p.27Chapter 3 --- Stride CAM Prefetching --- p.30Chapter 3.1 --- Introduction --- p.30Chapter 3.2 --- Architectural Model --- p.32Chapter 3.2.1 --- Compiler Support --- p.33Chapter 3.2.2 --- Hardware Support --- p.35Chapter 3.2.3 --- Model Details --- p.39Chapter 3.3 --- Optimization Issues --- p.39Chapter 3.3.1 --- Eliminating Reductant Prefetching --- p.40Chapter 3.3.2 --- Code Motion --- p.40Chapter 3.3.3 --- Burst Mode --- p.44Chapter 3.3.4 --- Stride CAM Overflow --- p.45Chapter 3.3.5 --- Effects of Loop Optimizations --- p.46Chapter 3.4 --- Practicability --- p.50Chapter 3.4.1 --- Evaluation Methodology --- p.51Chapter 3.4.2 --- Prefetch Accuracy --- p.54Chapter 3.4.3 --- Stride CAM Size --- p.56Chapter 3.4.4 --- Software Overhead --- p.60Chapter 4 --- Stride Register Prefetching --- p.67Chapter 4.1 --- Motivation --- p.67Chapter 4.2 --- Architectural Model --- p.67Chapter 4.2.1 --- Stride Register --- p.69Chapter 4.2.2 --- Compiler Support --- p.70Chapter 4.2.3 --- Prefetch Bits --- p.72Chapter 4.2.4 --- Operation Details --- p.77Chapter 4.3 --- Practicability and Optimizations --- p.78Chapter 4.3.1 --- Practicability on NASA7 Benchmark Programs --- p.78Chapter 4.3.2 --- Optimization Issues --- p.81Chapter 4.4 --- Comparison Between Stride CAM and Stride Register Models --- p.84Chapter 5 --- Small Software-Driven Array Cache --- p.87Chapter 5.1 --- Introduction --- p.87Chapter 5.2 --- Cache Pollution in MXM --- p.88Chapter 5.3 --- Architectural Model --- p.89Chapter 5.3.1 --- Operation Details --- p.91Chapter 5.4 --- Effectiveness of Array Cache --- p.92Chapter 6 --- Conclusion --- p.96Chapter 6.1 --- Conclusion --- p.96Chapter 6.2 --- Future Research: An Extension of the Stride CAM Model --- p.97Chapter 6.2.1 --- Background --- p.97Chapter 6.2.2 --- Reference Address Series --- p.98Chapter 6.2.3 --- Extending the Stride CAM Model --- p.100Chapter 6.2.4 --- Prefetch Overhead --- p.109Bibliography --- p.110Appendix --- p.114Chapter A --- Simulation Results - Stride CAM Model --- p.114Chapter A.l --- Execution Time --- p.114Chapter A.1.1 --- BTRIX --- p.114Chapter A.1.2 --- CFFT2D --- p.115Chapter A.1.3 --- CHOLSKY --- p.116Chapter A.1.4 --- EMIT --- p.117Chapter A.1.5 --- GMTRY --- p.118Chapter A.1.6 --- MXM --- p.119Chapter A.1.7 --- VPENTA --- p.120Chapter A.2 --- Memory Delay --- p.122Chapter A.2.1 --- BTRIX --- p.122Chapter A.2.2 --- CFFT2D --- p.123Chapter A.2.3 --- CHOLSKY --- p.124Chapter A.2.4 --- EMIT --- p.125Chapter A.2.5 --- GMTRY --- p.126Chapter A.2.6 --- MXM --- p.127Chapter A.2.7 --- VPENTA --- p.128Chapter A.3 --- Overhead --- p.129Chapter A.3.1 --- BTRIX --- p.129Chapter A.3.2 --- CFFT2D --- p.130Chapter A.3.3 --- CHOLSKY --- p.131Chapter A.3.4 --- EMIT --- p.132Chapter A.3.5 --- GMTRY --- p.133Chapter A.3.6 --- MXM --- p.134Chapter A.3.7 --- VPENTA --- p.135Chapter A.4 --- Hit Ratio --- p.136Chapter A.4.1 --- BTRIX --- p.136Chapter A.4.2 --- CFFT2D --- p.137Chapter A.4.3 --- CHOLSKY --- p.137Chapter A.4.4 --- EMIT --- p.138Chapter A.4.5 --- GMTRY --- p.139Chapter A.4.6 --- MXM --- p.139Chapter A.4.7 --- VPENTA --- p.140Chapter B --- Simulation Results - Array Cache --- p.141Chapter C --- NASA7 Benchmark --- p.145Chapter C.1 --- BTRIX --- p.145Chapter C.2 --- CFFT2D --- p.161Chapter C.2.1 --- cfft2dl --- p.161Chapter C.2.2 --- cfft2d2 --- p.169Chapter C.3 --- CHOLSKY --- p.179Chapter C.4 --- EMIT --- p.192Chapter C.5 --- GMTRY --- p.205Chapter C.6 --- MXM --- p.217Chapter C.7 --- VPENTA --- p.22

    Data prefetching using hardware register value predictable table.

    Get PDF
    by Chin-Ming, Cheung.Thesis (M.Phil.)--Chinese University of Hong Kong, 1996.Includes bibliographical references (leaves 95-97).Abstract --- p.iAcknowledgement --- p.iiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Overview --- p.1Chapter 1.2 --- Objective --- p.3Chapter 1.3 --- Organization of the dissertation --- p.4Chapter 2 --- Related Works --- p.6Chapter 2.1 --- Previous Cache Works --- p.6Chapter 2.2 --- Data Prefetching Techniques --- p.7Chapter 2.2.1 --- Hardware Vs Software Assisted --- p.7Chapter 2.2.2 --- Non-selective Vs Highly Selective --- p.8Chapter 2.2.3 --- Summary on Previous Data Prefetching Schemes --- p.12Chapter 3 --- Program Data Mapping --- p.13Chapter 3.1 --- Regular and Irregular Data Access --- p.13Chapter 3.2 --- Propagation of Data Access Regularity --- p.16Chapter 3.2.1 --- Data Access Regularity in High Level Program --- p.17Chapter 3.2.2 --- Data Access Regularity in Machine Code --- p.18Chapter 3.2.3 --- Data Access Regularity in Memory Address Sequence --- p.20Chapter 3.2.4 --- Implication --- p.21Chapter 4 --- Register Value Prediction Table (RVPT) --- p.22Chapter 4.1 --- Predictability of Register Values --- p.23Chapter 4.2 --- Register Value Prediction Table --- p.26Chapter 4.3 --- Control Scheme of RVPT --- p.29Chapter 4.3.1 --- Details of RVPT Mechanism --- p.29Chapter 4.3.2 --- Explanation of the Register Prediction Mechanism --- p.32Chapter 4.4 --- Examples of RVPT --- p.35Chapter 4.4.1 --- Linear Array Example --- p.35Chapter 4.4.2 --- Linked List Example --- p.36Chapter 5 --- Program Register Dependency --- p.39Chapter 5.1 --- Register Dependency --- p.40Chapter 5.2 --- Generalized Concept of Register --- p.44Chapter 5.2.1 --- Cyclic Dependent Register(CDR) --- p.44Chapter 5.2.2 --- Acyclic Dependent Register(ADR) --- p.46Chapter 5.3 --- Program Register Overview --- p.47Chapter 6 --- Generalized RVPT Model --- p.49Chapter 6.1 --- Level N RVPT Model --- p.49Chapter 6.1.1 --- Identification of Level N CDR --- p.51Chapter 6.1.2 --- Recording CDR instructions of Level N CDR --- p.53Chapter 6.1.3 --- Prediction of Level N CDR --- p.55Chapter 6.2 --- Level 2 Register Value Prediction Table --- p.55Chapter 6.2.1 --- Level 2 RVPT Structure --- p.56Chapter 6.2.2 --- Identification of Level 2 CDR --- p.58Chapter 6.2.3 --- Control Scheme of Level 2 RVPT --- p.59Chapter 6.2.4 --- Example of Index Array --- p.63Chapter 7 --- Performance Evaluation --- p.66Chapter 7.1 --- Evaluation Methodology --- p.66Chapter 7.1.1 --- Trace-Drive Simulation --- p.66Chapter 7.1.2 --- Architectural Method --- p.68Chapter 7.1.3 --- Benchmarks and Metrics --- p.70Chapter 7.2 --- General Result --- p.75Chapter 7.2.1 --- Constant Stride or Regular Data Access Applications --- p.77Chapter 7.2.2 --- Non-constant Stride or Irregular Data Access Applications --- p.79Chapter 7.3 --- Effect of Design Variations --- p.80Chapter 7.3.1 --- Effect of Cache Size --- p.81Chapter 7.3.2 --- Effect of Block Size --- p.83Chapter 7.3.3 --- Effect of Set Associativity --- p.86Chapter 7.4 --- Summary --- p.87Chapter 8 --- Conclusion and Future Research --- p.88Chapter 8.1 --- Conclusion --- p.88Chapter 8.2 --- Future Research --- p.90Bibliography --- p.95Appendix --- p.98Chapter A --- MCPI vs. cache size --- p.98Chapter B --- MCPI Reduction Percentage Vs cache size --- p.102Chapter C --- MCPI vs. block size --- p.106Chapter D --- MCPI Reduction Percentage Vs block size --- p.110Chapter E --- MCPI vs. set-associativity --- p.114Chapter F --- MCPI Reduction Percentage Vs set-associativity --- p.11
    corecore