Search CORE

140 research outputs found

Inefficiencies in the Cache Hierarchy: A Sensitivity Study of Cacheline Size with Mobile Workloads

Author: Emmons C
Van Laer A
Wang W
Publication venue: International Symposium on Memory Systems
Publication date: 05/10/2015
Field of study

With the rising number of cores in mobile devices, the cache hierarchy in mobile application processors gets deeper, and the cache size gets bigger. However, the cacheline size remained relatively constant over the last decade in mobile application processors. In this work, we investigate whether the cacheline size in mobile application processors is due for a refresh, by looking at inefficiencies in the cache hierarchy which tend to be exacerbated when increasing the cacheline size: false sharing and cacheline utilization. Firstly, we look at false sharing, which is more likely to arise at larger cacheline sizes and can severely impact performance. False sharing occurs when non-shared data structures, mapped onto the same cacheline, are being accessed by threads running on different cores, causing avoidable invalidations and subsequent misses. False sharing has been found in various places such as scientific workloads and real applications. We find that whilst increasing the cacheline size does increase false sharing, it still is negligible when compared to known cases of false sharing in scientific workloads, due to the limited level of thread-level parallelism in mobile workloads. Secondly, we look at cacheline utilization which measures the number of bytes in a cacheline actually used by the processor. This effect has been investigated under various names for a multitude of server and desktop applications. As a low cacheline utilization implies that very little of the fetched cachelines was used by the processor, this causes waste in bandwidth and energy in moving data across the memory hierarchy. The energy cost associated with data movements is much higher compared to logic operations, increasing the need for cache efficiency, especially in the case of an energy-constrained platform like a mobile device. We find that the cacheline utilization of mobile workloads is low in general, decreasing when increasing the cacheline size. When increasing the cacheline size from 64 bytes to 128 bytes, the number of misses will be reduced by 10%-30%, depending on the workload. However, because of the low cacheline utilization, this more than doubles the amount of unused traffic to the L1 caches. Using the cacheline utilization as a metric in this way, illustrates an important point. If a change in cacheline size would only be assessed on its local effects, we find that this change in cacheline size will only have advantages as the miss rate decreases. However, at system level, this change will increase the stress on the bus and increase the amount of wasted energy due to unused traffic. Using cacheline utilization as a metric underscores the need for system-level research when changing characteristics of the cache hierarchy

UCL Discovery

An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2007
Field of study

Crossref

Exploring the value of supporting multiple DSM protocols in Hardware DSM Controllers

Author: Kuramkote Ravindra
Publication venue: University of Utah
Publication date: 01/01/1999
Field of study

Journal ArticleThe performance of a hardware distributed shared memory (DSM) system is largely dependent on its architect's ability to reduce the number of remote memory misses that occur. Previous attempts to solve this problem have included measures such as supporting both the CC-NUMA and S-COMA architectures is the same machine and providing a programmable DSM controller that can emulate any DSM mechanism. In this paper we first present the design of a DSM controller that supports multiple DSM protocols in custom hardware, and allows the programmer or compiler to specify on a per-variable basis what protocol to use to keep that variable coherent. This simulated performance of this DSM controller compares favorably with that of conventional single-protocol custom hardware designs, often outperforming the conventional systems by a factor of two. To achieve these promising results, that multi-protocol DSM controller needed to support only two DSM architectures (CC-NUMA and S-COMA) and three coherency protocols (both release and sequentially consistent write invalidate and release consistent write update). This work demonstrates the value of supporting a degree of flexibility in one's DSM controller design and suggests what operations such a flexible DSM controller should support

The University of Utah: J. Willard Marriott Digital Library

Adaptive Block Pinning for Multi-core Architectures

Author: Chaturvedi Nitin
Kumar Rakesh
Sudarshan T S B
Publication venue
Publication date: 01/01/2008
Field of study

Edinburgh Research Explorer

A Survey of Techniques for Architecting TLBs

Author: Mittal Sparsh
Publication venue: 'Wiley'
Publication date: 01/01/2016
Field of study

“Translation lookaside buffer” (TLB) caches virtual to physical address translation information and is used in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently and a TLB miss is extremely costly, prudent management of TLB is important for improving performance and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and distinctions. We believe that this paper will be useful for chip designers, computer architects and system engineers

Research Archive of Indian Institute of Technology Hyderabad

Effective Compile-Time Analysis for Data Prefetching In Java

Author: Brendon D. Cahoon
Publication venue
Publication date: 01/01/2002
Field of study

The memory hierarchy in modern architectures continues to be a major performance bottleneck. Many existing techniques for improving memory performance focus on Fortran and C programs, but memory latency is also a barrier to achieving high performance in object-oriented languages. Existing software techniques are inadequate for exposing optimization opportunities in object-oriented programs. One key problem is the use of high-level programming abstractions which make analysis difficult. Another challenge is that programmers use a variety of data structures, including arrays and linked structures, so optimizations must work on a broad range of programs. We develop a new unified data-flow analysis for identifying accesses to arrays and linked structures called recurrence analysis. Prior approaches that identify these access patterns are ad hoc, or treat arrays and linked structures independently. The data-flow analysis is intra- and inter-procedural, which is important in Java programs that use encapsulation to hide implementation details. We sho

CiteSeerX

ScholarWorks@UMass Amherst

Reducing Waste in Memory Hierarchies

Author: Tian Yingying
Publication venue
Publication date: 21/09/2015
Field of study

Memory hierarchies play an important role in microarchitectural design to bridge the performance gap between modern microprocessors and main memory. However, memory hierarchies are inefficient due to storing waste. This dissertation quantifies two types of waste, dead blocks and data redundancy. This dissertation studies waste in diverse memory hierarchies and proposes techniques to reduce waste to improve performance with limited overhead. This dissertation observes that waste of dead blocks in an inclusive last level cache consists of two kinds of blocks: blocks that are highly accessed in core caches and blocks that have low temporal locality in both core caches and the last-level cache. Blindly replacing all dead blocks in an inclusive last level cache may degrade performance. This dissertation proposes temporal-based multilevel correlating cache replacement to improve performance of inclusive cache hierarchies. This dissertation observes that waste exists in private caches of graphics processing units (GPUs) as zero-reuse blocks. This dissertation defines zero-reuse blocks as blocks that are dead after being inserted into caches. This dissertation proposes adaptive GPU cache bypassing technique to improve performance as well as reducing power consumption by dynamically bypassing zero-reuse blocks. This dissertation exploits waste of data redundancy at the block-level granularity and finds that conventional cache design wastes capacity because it stores duplicate data. This dissertation quantifies the percentage of data duplication and analyze causes. This dissertation proposes a practical cache deduplication technique to increase the effectiveness of the cache with limited area and power consumption

Texas A&M Repository

Study of the effect of optimizations on OpenCL code when executed on different heterogeneous architectures

Author: Θεοχαρίδης Κωνσταντίνος Χ.
Καλογήρου Χρήστος Κ.
Publication venue
Publication date: 01/01/2013
Field of study

University of Thessaly Institutional Repository

High-level real-time programming in Java

Author: Christoph M. Kirsch
Daniel Spoonhower
David F. Bacon
David Grove
Eran Yahav
Martin T. Vechev
Matthias Hauswirth
Michael Hind
Perry Cheng
V. T. Rajan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2005
Field of study

Real-time systems have reached a level of complexity beyond the scaling capability of the low-level or restricted languages traditionally used for real-time programming. While Metronome garbage collection has made it practical to use Java to implement real-time systems, many challenges remain for the construction of complex real-time systems, some specic to the use of Java and others simply due to the change in scale of such systems. The goal of our research is the creation of a comprehensive Java-based programming environment and methodology for the creation of complex real-time systems. Our goals include construction of a provably correct real-time garbage collec-tor capable of providing worst case latencies of 100 s, capa-ble of scaling from sensor nodes up to large multiprocessors; specialized programming constructs that retain the safety and simplicity of Java, and yet provide sub-microsecond la-tencies; the extension of Java's \write once, run anywhere" principle from functional correctness to timing behavior; on-line analysis and visualization that aids in the understanding of complex behaviors; and a principled probabilistic analy-sis methodology for bounding the behavior of the resulting systems. While much remains to be done, this paper describes the progress we have made towards these goals

CiteSeerX

Crossref