51 research outputs found

    Dynamic data shapers optimize performance in Dynamic Binary Optimization (DBO) environment

    Get PDF
    Processor hardware has been architected with the assumption that most data access patterns would be linearly spatial in nature. But, most applications involve algorithms that are designed with optimal efficiency in mind, which results in non-spatial, multi-dimensional data access. Moreover, this data view or access pattern changes dynamically in different program phases. This results in a mismatch between the processor hardware\u27s view of data and the algorithmic view of data, leading to significant memory access bottlenecks. This variation in data views is especially more pronounced in applications involving large datasets, leading to significantly increased latency and user response times. Previous attempts to tackle this problem were primarily targeted at execution time optimization. We present a dynamic technique piggybacked on the classical dynamic binary optimization (DBO) to shape the data view for each program phase differently resulting in program execution time reduction along with reductions in access energy. Our implementation rearranges non-adjacent data into a contiguous dataview. It uses wrappers to replace irregular data access patterns with spatially local dataview. HDTrans, a runtime dynamic binary optimization framework has been used to perform runtime instrumentation and dynamic data optimization to achieve this goal. This scheme not only ensures a reduced program execution time, but also results in lower energy use. Some of the commonly used benchmarks from the SPEC 2006 suite were profiled to determine irregular data accesses from procedures which contributed heavily to the overall execution time. Wrappers built to replace these accesses with spatially adjacent data led to a significant improvement in the total execution time. On average, 20% reduction in time was achieved along with a 5% reduction in energy

    Effective Compile-Time Analysis for Data Prefetching In Java

    Get PDF
    The memory hierarchy in modern architectures continues to be a major performance bottleneck. Many existing techniques for improving memory performance focus on Fortran and C programs, but memory latency is also a barrier to achieving high performance in object-oriented languages. Existing software techniques are inadequate for exposing optimization opportunities in object-oriented programs. One key problem is the use of high-level programming abstractions which make analysis difficult. Another challenge is that programmers use a variety of data structures, including arrays and linked structures, so optimizations must work on a broad range of programs. We develop a new unified data-flow analysis for identifying accesses to arrays and linked structures called recurrence analysis. Prior approaches that identify these access patterns are ad hoc, or treat arrays and linked structures independently. The data-flow analysis is intra- and inter-procedural, which is important in Java programs that use encapsulation to hide implementation details. We sho

    Improving on-chip data cache using instruction register information.

    Get PDF
    by Lau Siu Chung.Thesis (M.Phil.)--Chinese University of Hong Kong, 1996.Includes bibliographical references (leaves 71-74).Abstract --- p.iAcknowledgment --- p.iiList of Figures --- p.vChapter Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Hiding memory latency --- p.1Chapter 1.2 --- Organization of dissertation --- p.4Chapter Chapter 2 --- Related Work --- p.5Chapter 2.1 --- Hardware controlled cache prefetching --- p.5Chapter 2.2 --- Software assisted cache prefetching --- p.9Chapter Chapter 3 --- Data Prefetching --- p.13Chapter 3.1 --- Data reference patterns --- p.14Chapter 3.2 --- Embedded hints for next data references --- p.19Chapter 3.3 --- Instruction Opcode and Addressing Mode Prefetching scheme --- p.21Chapter 3.3.1 --- Basic IAP scheme --- p.21Chapter 3.3.2 --- Enhanced IAP scheme --- p.24Chapter 3.3.3 --- Combined IAP scheme --- p.27Chapter 3.4 --- Summary --- p.29Chapter Chapter 4 --- Performance Evaluation --- p.31Chapter 4.1 --- Evaluation methodology --- p.31Chapter 4.1.1 --- Trace-driven simulation --- p.31Chapter 4.1.2 --- Caching models --- p.33Chapter 4.1.3 --- Benchmarks and metrics --- p.36Chapter 4.2 --- General Results --- p.41Chapter 4.2.1 --- Varying cache size --- p.44Chapter 4.2.2 --- Varying cache block size --- p.46Chapter 4.2.3 --- Varying associativity --- p.49Chapter 4.3 --- Other performance metrics --- p.52Chapter 4.3.1 --- Accuracy of prefetch --- p.52Chapter 4.3.2 --- Partial hit delay --- p.55Chapter 4.3.3 --- Bus usage problem --- p.59Chapter 4.4 --- Zero time prefetch --- p.63Chapter 4.5 --- Summary --- p.67Chapter Chapter 5 --- Conclusion --- p.68Chapter 5.1 --- Summary of our research --- p.68Chapter 5.2 --- Future work --- p.70Bibliography --- p.7

    Dynamic Data Shapers Optimize Performance in Dynamic Binary Optimization (DBO) Environment

    Full text link

    A Survey of Techniques for Architecting TLBs

    Get PDF
    “Translation lookaside buffer” (TLB) caches virtual to physical address translation information and is used in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently and a TLB miss is extremely costly, prudent management of TLB is important for improving performance and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and distinctions. We believe that this paper will be useful for chip designers, computer architects and system engineers

    Compiler driver memory system optimization using speculative execution

    Get PDF
    Master'sMASTER OF SCIENC

    Software-assisted data prefetching algorithms.

    Get PDF
    by Chi-sum, Ho.Thesis (M.Phil.)--Chinese University of Hong Kong, 1995.Includes bibliographical references (leaves 110-113).Abstract --- p.iAcknowledgement --- p.iiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Overview --- p.1Chapter 1.2 --- Cache Memories --- p.1Chapter 1.3 --- Improving Cache Performance --- p.3Chapter 1.4 --- Improving System Performance --- p.4Chapter 1.5 --- Organization of the dissertation --- p.6Chapter 2 --- Related Work --- p.8Chapter 2.1 --- Cache Performance --- p.8Chapter 2.2 --- Non-Blocking Cache --- p.9Chapter 2.3 --- Cache Prefetching --- p.10Chapter 2.3.1 --- Hardware Prefetching --- p.10Chapter 2.3.2 --- Software-assisted Prefetching --- p.13Chapter 2.3.3 --- Improving Cache Effectiveness --- p.22Chapter 2.4 --- Other Techniques to Reduce and Hide Memory Latencies --- p.25Chapter 2.4.1 --- Register Preloading --- p.25Chapter 2.4.2 --- Write Policies --- p.26Chapter 2.4.3 --- Small Specialized Cache --- p.26Chapter 2.4.4 --- Program Transformation --- p.27Chapter 3 --- Stride CAM Prefetching --- p.30Chapter 3.1 --- Introduction --- p.30Chapter 3.2 --- Architectural Model --- p.32Chapter 3.2.1 --- Compiler Support --- p.33Chapter 3.2.2 --- Hardware Support --- p.35Chapter 3.2.3 --- Model Details --- p.39Chapter 3.3 --- Optimization Issues --- p.39Chapter 3.3.1 --- Eliminating Reductant Prefetching --- p.40Chapter 3.3.2 --- Code Motion --- p.40Chapter 3.3.3 --- Burst Mode --- p.44Chapter 3.3.4 --- Stride CAM Overflow --- p.45Chapter 3.3.5 --- Effects of Loop Optimizations --- p.46Chapter 3.4 --- Practicability --- p.50Chapter 3.4.1 --- Evaluation Methodology --- p.51Chapter 3.4.2 --- Prefetch Accuracy --- p.54Chapter 3.4.3 --- Stride CAM Size --- p.56Chapter 3.4.4 --- Software Overhead --- p.60Chapter 4 --- Stride Register Prefetching --- p.67Chapter 4.1 --- Motivation --- p.67Chapter 4.2 --- Architectural Model --- p.67Chapter 4.2.1 --- Stride Register --- p.69Chapter 4.2.2 --- Compiler Support --- p.70Chapter 4.2.3 --- Prefetch Bits --- p.72Chapter 4.2.4 --- Operation Details --- p.77Chapter 4.3 --- Practicability and Optimizations --- p.78Chapter 4.3.1 --- Practicability on NASA7 Benchmark Programs --- p.78Chapter 4.3.2 --- Optimization Issues --- p.81Chapter 4.4 --- Comparison Between Stride CAM and Stride Register Models --- p.84Chapter 5 --- Small Software-Driven Array Cache --- p.87Chapter 5.1 --- Introduction --- p.87Chapter 5.2 --- Cache Pollution in MXM --- p.88Chapter 5.3 --- Architectural Model --- p.89Chapter 5.3.1 --- Operation Details --- p.91Chapter 5.4 --- Effectiveness of Array Cache --- p.92Chapter 6 --- Conclusion --- p.96Chapter 6.1 --- Conclusion --- p.96Chapter 6.2 --- Future Research: An Extension of the Stride CAM Model --- p.97Chapter 6.2.1 --- Background --- p.97Chapter 6.2.2 --- Reference Address Series --- p.98Chapter 6.2.3 --- Extending the Stride CAM Model --- p.100Chapter 6.2.4 --- Prefetch Overhead --- p.109Bibliography --- p.110Appendix --- p.114Chapter A --- Simulation Results - Stride CAM Model --- p.114Chapter A.l --- Execution Time --- p.114Chapter A.1.1 --- BTRIX --- p.114Chapter A.1.2 --- CFFT2D --- p.115Chapter A.1.3 --- CHOLSKY --- p.116Chapter A.1.4 --- EMIT --- p.117Chapter A.1.5 --- GMTRY --- p.118Chapter A.1.6 --- MXM --- p.119Chapter A.1.7 --- VPENTA --- p.120Chapter A.2 --- Memory Delay --- p.122Chapter A.2.1 --- BTRIX --- p.122Chapter A.2.2 --- CFFT2D --- p.123Chapter A.2.3 --- CHOLSKY --- p.124Chapter A.2.4 --- EMIT --- p.125Chapter A.2.5 --- GMTRY --- p.126Chapter A.2.6 --- MXM --- p.127Chapter A.2.7 --- VPENTA --- p.128Chapter A.3 --- Overhead --- p.129Chapter A.3.1 --- BTRIX --- p.129Chapter A.3.2 --- CFFT2D --- p.130Chapter A.3.3 --- CHOLSKY --- p.131Chapter A.3.4 --- EMIT --- p.132Chapter A.3.5 --- GMTRY --- p.133Chapter A.3.6 --- MXM --- p.134Chapter A.3.7 --- VPENTA --- p.135Chapter A.4 --- Hit Ratio --- p.136Chapter A.4.1 --- BTRIX --- p.136Chapter A.4.2 --- CFFT2D --- p.137Chapter A.4.3 --- CHOLSKY --- p.137Chapter A.4.4 --- EMIT --- p.138Chapter A.4.5 --- GMTRY --- p.139Chapter A.4.6 --- MXM --- p.139Chapter A.4.7 --- VPENTA --- p.140Chapter B --- Simulation Results - Array Cache --- p.141Chapter C --- NASA7 Benchmark --- p.145Chapter C.1 --- BTRIX --- p.145Chapter C.2 --- CFFT2D --- p.161Chapter C.2.1 --- cfft2dl --- p.161Chapter C.2.2 --- cfft2d2 --- p.169Chapter C.3 --- CHOLSKY --- p.179Chapter C.4 --- EMIT --- p.192Chapter C.5 --- GMTRY --- p.205Chapter C.6 --- MXM --- p.217Chapter C.7 --- VPENTA --- p.22

    Instruction prefetching techniques for ultra low-power multicore architectures

    Get PDF
    As the gap between processor and memory speeds increases, memory latencies have become a critical bottleneck for computing performance. To reduce this bottleneck, designers have been working on techniques to hide these latencies. On the other hand, design of embedded processors typically targets low cost and low power consumption. Therefore, techniques which can satisfy these constraints are more desirable for embedded domains. While out-of-order execution, aggressive speculation, and complex branch prediction algorithms can help hide the memory access latency in high-performance systems, yet they can cost a heavy power budget and are not suitable for embedded systems. Prefetching is another popular method for hiding the memory access latency, and has been studied very well for high-performance processors. Similarly, for embedded processors with strict power requirements, the application of complex prefetching techniques is greatly limited, and therefore, a low power/energy solution is mostly desired in this context. In this work, we focus on instruction prefetching for ultra-low power processing architectures and aim to reduce energy overhead of this operation by proposing a combination of simple, low-cost, and energy efficient prefetching techniques. We study a wide range of applications from cryptography to computer vision and show that our proposed mechanisms can effectively improve the hit-rate of almost all of them to above 95%, achieving an average performance improvement of more than 2X. Plus, by synthesizing our designs using the state-of-the-art technologies we show that the prefetchers increase system’s power consumption less than 15% and total silicon area by less than 1%. Altogether, a total energy reduction of 1.9X is achieved, thanks to the proposed schemes, enabling a significantly higher battery life
    corecore