Reducing load latency through memory instruction characterization.

Abstract

Processor performance is directly impacted by the latency of the memory system. As processor core cycle times decrease, the disparity between the latency of an arithmetic instruction and the average latency of a load instruction will continue to increase. A wide-issue superscalar machine requires a memory system highly optimized for latency. This dissertation analyzes the patterns of data sharing between memory instructions and the address calculation chains leading up to each load instruction. The analysis of memory instruction data sharing patterns shows that the dynamic address stream can be broken into several independent streams. This observation is used to segment the first level of the memory hierarchy, including the memory disambiguation logic, into several independent partitions. A partitioned cache with eight partitions can be accessed in half the time of an equivalently-sized unpartitioned cache. An aggressive processor implementing a partitioned first-level cache outperformed the same processor implementing an equivalently-sized conventional cache by 4.5% on the SPECint00 benchmark suite. The analysis of address calculation chains demonstrates that, a relatively small number of unique functions are used in the calculation of memory data addresses within an application. A method of dynamically identifying these functions and reproducing them in hardware is developed. This technique allows the results of complex address calculations to be generated independently of the program instruction stream and without executing the instructions involved in the calculation. A processor utilizing this scheme outperformed a processor implementing conventional address prediction by 5.5% on the SPECint00 bench-mark suite.Ph.D.Applied SciencesElectrical engineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/123938/2/3106150.pd

    Similar works