283 research outputs found

    Using Prime Numbers for Cache Indexing to Eliminate Conflict Misses, HPCA

    Get PDF
    Using alternative cache indexing/hashing functions is a popular technique to reduce conflict misses by achieving a more uniform cache access distribution across the sets in the cache. Although various alternative hashing functions have been demonstrated to eliminate the worst case conflict behavior, no study has really analyzed the pathological behavior of such hashing functions that often result in performance slowdown. In this paper, we present an in-depth analysis of the pathological behavior of cache hashing functions. Based on the analysis, we propose two new hashing functions: prime modulo and prime displacement that are resistant to pathological behavior and yet are able to eliminate the worst case conflict behavior in the L2 cache. We show that these two schemes can be implemented in fast hardware using a set of narrow add operations, with negligible fragmentation in the L2 cache. We evaluate the schemes on 23 memory intensive applications. For applications that have non-uniform cache accesses, both prime modulo and prime displacement hashing achieve an average speedup of 1.27 compared to traditional hashing, without slowing down any of the 23 benchmarks. We also evaluate using multiple prime displacement hashing functions in conjunction with a skewed associative L2 cache. The skewed associative cache achieves a better average speedup at the cost of some pathological behavior that slows down four applications by up to 7%. 1

    Randomized cache placement for eliminating conflicts

    Get PDF
    Applications with regular patterns of memory access can experience high levels of cache conflict misses. In shared-memory multiprocessors conflict misses can be increased significantly by the data transpositions required for parallelization. Techniques such as blocking which are introduced within a single thread to improve locality, can result in yet more conflict misses. The tension between minimizing cache conflicts and the other transformations needed for efficient parallelization leads to complex optimization problems for parallelizing compilers. This paper shows how the introduction of a pseudorandom element into the cache index function can effectively eliminate repetitive conflict misses and produce a cache where miss ratio depends solely on working set behavior. We examine the impact of pseudorandom cache indexing on processor cycle times and present practical solutions to some of the major implementation issues for this type of cache. Our conclusions are supported by simulations of a superscalar out-of-order processor executing the SPEC95 benchmarks, as well as from cache simulations of individual loop kernels to illustrate specific effects. We present measurements of instructions committed per cycle (IPC) when comparing the performance of different cache architectures on whole-program benchmarks such as the SPEC95 suite.Peer ReviewedPostprint (published version

    Improving the Performance and Energy Efficiency of GPGPU Computing through Adaptive Cache and Memory Management Techniques

    Get PDF
    Department of Computer Science and EngineeringAs the performance and energy efficiency requirement of GPGPUs have risen, memory management techniques of GPGPUs have improved to meet the requirements by employing hardware caches and utilizing heterogeneous memory. These techniques can improve GPGPUs by providing lower latency and higher bandwidth of the memory. However, these methods do not always guarantee improved performance and energy efficiency due to the small cache size and heterogeneity of the memory nodes. While prior works have proposed various techniques to address this issue, relatively little work has been done to investigate holistic support for memory management techniques. In this dissertation, we analyze performance pathologies and propose various techniques to improve memory management techniques. First, we investigate the effectiveness of advanced cache indexing (ACI) for high-performance and energy-efficient GPGPU computing. Specifically, we discuss the designs of various static and adaptive cache indexing schemes and present implementation for GPGPUs. We then quantify and analyze the effectiveness of the ACI schemes based on a cycle-accurate GPGPU simulator. Our quantitative evaluation shows that ACI schemes achieve significant performance and energy-efficiency gains over baseline conventional indexing scheme. We also analyze the performance sensitivity of ACI to key architectural parameters (i.e., capacity, associativity, and ICN bandwidth) and the cache indexing latency. We also demonstrate that ACI continues to achieve high performance in various settings. Second, we propose IACM, integrated adaptive cache management for high-performance and energy-efficient GPGPU computing. Based on the performance pathology analysis of GPGPUs, we integrate state-of-the-art adaptive cache management techniques (i.e., cache indexing, bypassing, and warp limiting) in a unified architectural framework to eliminate performance pathologies. Our quantitative evaluation demonstrates that IACM significantly improves the performance and energy efficiency of various GPGPU workloads over the baseline architecture (i.e., 98.1% and 61.9% on average, respectively) and achieves considerably higher performance than the state-of-the-art technique (i.e., 361.4% at maximum and 7.7% on average). Furthermore, IACM delivers significant performance and energy efficiency gains over the baseline GPGPU architecture even when enhanced with advanced architectural technologies (e.g., higher capacity, associativity). Third, we propose bandwidth- and latency-aware page placement (BLPP) for GPGPUs with heterogeneous memory. BLPP analyzes the characteristics of a application and determines the optimal page allocation ratio between the GPU and CPU memory. Based on the optimal page allocation ratio, BLPP dynamically allocate pages across the heterogeneous memory nodes. Our experimental results show that BLPP considerably outperforms the baseline and state-of-the-art technique (i.e., 13.4% and 16.7%) and performs similar to the static-best version (i.e., 1.2% difference), which requires extensive offline profiling.clos

    Exploring Alternate Cache Indexing Techniques

    Get PDF
    Cache memory is a bridging component which covers the increasing gap between the speed of a processor and main memory. An excellent performance of the cache is crucial to improve system performance. Conflict misses are one of the critical reasons that limit the cache performance by mapping blocks to the same set which results in the eviction of many blocks. However, many blocks in the cache sets are not mapped, and thus the available space is not efficiently utilized. A direct way to reduce conflict misses is to increase associativity, but this comes with the cost of an increase in the hit time. Another way to reduce conflict misses is to change the cache-indexing scheme and distribute the accesses across all sets. This thesis focuses on the second way mentioned above and aims to evaluate the impact of the matrix-based indexing scheme on cache performance against the traditional modulus-based indexing scheme. A correlation between the proposed indexing scheme and different cache replacement policies is also observed. The matrix-based indexing scheme yields a geometric mean speedup of 1.2% for SPEC CPU 2017 benchmarks for single core simulations when applied for direct-mapped last level cache. In this case, an improvement of 1.5% and 4% is observed for at least eighteen and seven of SPEC CPU2017 applications respectively. Also, it yields 2% of performance improvement over sixteen SPEC CPU2006 benchmarks. The new indexing scheme correlates well with multiperspective reuse prediction. It is observed that LRU benefits machine learning benchmark by a performance of 5.1%. For multicore simulations, the new indexing scheme does not improve performance significantly. However, this scheme also does not impact the application’s performance negatively

    A Study on Performance and Power Efficiency of Dense Non-Volatile Caches in Multi-Core Systems

    Full text link
    In this paper, we present a novel cache design based on Multi-Level Cell Spin-Transfer Torque RAM (MLC STTRAM) that can dynamically adapt the set capacity and associativity to use efficiently the full potential of MLC STTRAM. We exploit the asymmetric nature of the MLC storage scheme to build cache lines featuring heterogeneous performances, that is, half of the cache lines are read-friendly, while the other is write-friendly. Furthermore, we propose to opportunistically deactivate ways in underutilized sets to convert MLC to Single-Level Cell (SLC) mode, which features overall better performance and lifetime. Our ultimate goal is to build a cache architecture that combines the capacity advantages of MLC and performance/energy advantages of SLC. Our experiments show an improvement of 43% in total numbers of conflict misses, 27% in memory access latency, 12% in system performance, and 26% in LLC access energy, with a slight degradation in cache lifetime (about 7%) compared to an SLC cache

    Exploring Alternate Cache Indexing Techniques

    Get PDF
    Cache memory is a bridging component which covers the increasing gap between the speed of a processor and main memory. An excellent performance of the cache is crucial to improve system performance. Conflict misses are one of the critical reasons that limit the cache performance by mapping blocks to the same set which results in the eviction of many blocks. However, many blocks in the cache sets are not mapped, and thus the available space is not efficiently utilized. A direct way to reduce conflict misses is to increase associativity, but this comes with the cost of an increase in the hit time. Another way to reduce conflict misses is to change the cache-indexing scheme and distribute the accesses across all sets. This thesis focuses on the second way mentioned above and aims to evaluate the impact of the matrix-based indexing scheme on cache performance against the traditional modulus-based indexing scheme. A correlation between the proposed indexing scheme and different cache replacement policies is also observed. The matrix-based indexing scheme yields a geometric mean speedup of 1.2% for SPEC CPU 2017 benchmarks for single core simulations when applied for direct-mapped last level cache. In this case, an improvement of 1.5% and 4% is observed for at least eighteen and seven of SPEC CPU2017 applications respectively. Also, it yields 2% of performance improvement over sixteen SPEC CPU2006 benchmarks. The new indexing scheme correlates well with multiperspective reuse prediction. It is observed that LRU benefits machine learning benchmark by a performance of 5.1%. For multicore simulations, the new indexing scheme does not improve performance significantly. However, this scheme also does not impact the application’s performance negatively

    Exploring Alternate Cache Indexing Techniques

    Get PDF
    Cache memory is a bridging component which covers the increasing gap between the speed of a processor and main memory. An excellent performance of the cache is crucial to improve system performance. Conflict misses are one of the critical reasons that limit the cache performance by mapping blocks to the same set which results in the eviction of many blocks. However, many blocks in the cache sets are not mapped, and thus the available space is not efficiently utilized. A direct way to reduce conflict misses is to increase associativity, but this comes with the cost of an increase in the hit time. Another way to reduce conflict misses is to change the cache-indexing scheme and distribute the accesses across all sets. This thesis focuses on the second way mentioned above and aims to evaluate the impact of the matrix-based indexing scheme on cache performance against the traditional modulus-based indexing scheme. A correlation between the proposed indexing scheme and different cache replacement policies is also observed. The matrix-based indexing scheme yields a geometric mean speedup of 1.2% for SPEC CPU 2017 benchmarks for single core simulations when applied for direct-mapped last level cache. In this case, an improvement of 1.5% and 4% is observed for at least eighteen and seven of SPEC CPU2017 applications respectively. Also, it yields 2% of performance improvement over sixteen SPEC CPU2006 benchmarks. The new indexing scheme correlates well with multiperspective reuse prediction. It is observed that LRU benefits machine learning benchmark by a performance of 5.1%. For multicore simulations, the new indexing scheme does not improve performance significantly. However, this scheme also does not impact the application’s performance negatively

    Architecting Secure Processor Caches

    Get PDF
    Caches in modern processors enable fast access to data and help alleviate the performance overheads from slow access to DRAM main-memory. While sharing of cache resources between multiple cores, especially the last-level cache, boosts cache utilization and improves system performance, it has been shown to cause serious security vulnerabilities in the form cache side-channel attacks. Different cores of a system can simultaneously run sensitive and malicious applications which can contend for the shared cache space. As a result, accesses of a sensitive application can influence the cache utilization and the execution time of a malicious application, introducing a side-channel of information leakage. Such cache interactions between a sensitive victim and a malicious spy have been shown to allow leakage of encryption keys, user-sensitive data such as files or browsing histories, confidential intellectual property such as machine-learning models, etc. Similarly, such cache interactions can also be used as a channel for covert communication be- tween two colluding malicious applications, when direct communication via network ports is disabled. The focus of this thesis is to develop principled and practical mitigation for such cache side channel and covert channel attacks. To develop principled defenses, it is necessary to develop a deep understanding of attacks. So, first, this thesis investigates the capabilities of attackers and in the process develops a new cache covert channel attack called Streamline, which is considerably faster than current state-of-the-art attacks, with fewer requirements. With an asynchronous and flushless information transmission protocol, Streamline reaches bit-rates of more than 1 MB/s while being applicable to all ISAs and micro-architectures. This demonstrates the need for effective defenses against cache attacks across all platforms. Second, this thesis develops new principled and practical defenses utilizing cache lo- cation randomization. Randomized caches obfuscate the mappings of addresses to cache locations to prevent malicious programs from inferring contention patterns on shared last- level caches with victim programs. However, successive defenses relying on randomization have been broken by recent attacks. To end the arms race in randomized caches, this thesis proposes a principled defense, MIRAGE, which provides the security of a fully-associative design in a practical manner for randomized caches. This eliminates set-conflicts and set- conflict based cache attacks in a future-proof manner. Third, this thesis explores cache-partitioning based defenses to eliminate all potential cache side channels through shared last-level caches. Such defenses map mistrusting applications to isolated cache partitions, thus preventing any information leakage across applications through cache state changes. However, existing solutions are not scalable or do not allow flexible usage of DRAM and cache resources. To address these problems, this thesis provides a scalable and flexible cache-isolation framework, Bespoke Cache Enclaves, supporting hundreds of partitions independent of memory utilization. This work enables practical adoption of cache-isolation defenses against cache side-channel attacks. Lastly, this thesis develops techniques to secure caches against exploitation in transient execution attacks. Attacks like Spectre and Meltdown exploit processor speculation to illegally access secrets and leak these out through cache covert channels, i.e., making transient changes to processor caches. This thesis enables CleanupSpec, one of the first defenses against such attacks, which reverses speculative modifications to caches on mis- speculations, to limit such transient information leakage via caches. This solution prevents caches from being exploited by attacks like Spectre with minimal overheads. Overall, this thesis enables several techniques that provide principled yet practical security for processor caches against side channels and covert channels. These techniques can potentially enable the wide adoption of secure cache designs in future processors and support efforts to enable confidential computing in systems.Ph.D

    Exploiting cache locality at run-time

    Get PDF
    With the increasing gap between the speeds of the processor and memory system, memory access has become a major performance bottleneck in modern computer systems. Recently, Symmetric Multi-Processor (SMP) systems have emerged as a major class of high-performance platforms. Improving the memory performance of Parallel applications with dynamic memory-access patterns on Symmetric Multi-Processors (SMP) is a hard problem. The solution to this problem is critical to the successful use of the SMP systems because dynamic memory-access patterns occur in many real-world applications. This dissertation is aimed at solving this problem.;Based on a rigorous analysis of cache-locality optimization, we propose a memory-layout oriented run-time technique to exploit the cache locality of parallel loops. Our technique have been implemented in a run-time system. Using simulation and measurement, we have shown our run-time approach can achieve comparable performance with compiler optimizations for those regular applications, whose load balance and cache locality can be well optimized by tiling and other program transformations. However, our approach was shown to improve significantly the memory performance for applications with dynamic memory-access patterns. Such applications are usually hard to optimize with static compiler optimizations.;Several contributions are made in this dissertation. We present models to characterize the complexity and present a solution framework for optimizing cache locality. We present an effective estimation technique for memory-access patterns to support efficient locality optimizations and information integration. We present a memory-layout oriented run-time technique for locality optimization. We present efficient scheduling algorithms to trade off locality and load imbalance. We provide a detailed performance evaluation of the run-time technique
    • 

    corecore