1,470 research outputs found

    Cache Equalizer: A Cache Pressure Aware Block Placement Scheme for Large-Scale Chip Multiprocessors

    Get PDF
    This paper describes Cache Equalizer (CE), a novel distributed cache management scheme for large scale chip multiprocessors (CMPs). Our work is motivated by large asymmetry in cache sets usages. CE decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Temporal pressure at the on-chip last-level cache, is continuously collected at a group (comprised of cache sets) granularity, and periodically recorded at the memory controller to guide the placement process. An incoming block is consequently placed at a cache group that exhibits the minimum pressure. CE provides Quality of Service (QoS) by robustly offering better performance than the baseline shared NUCA cache. Simulation results using a full-system simulator demonstrate that CE outperforms shared NUCA caches by an average of 15.5% and by as much as 28.5% for the benchmark programs we examined. Furthermore, evaluations manifested the outperformance of CE versus related CMP cache designs

    Ubiquitous Memory Introspection (Preliminary Manuscript)

    Get PDF
    Modern memory systems play a critical role in the performance ofapplications, but a detailed understanding of the application behaviorin the memory system is not trivial to attain. It requires timeconsuming simulations of the memory hierarchy using long traces, andoften using detailed modeling. It is increasingly possible to accesshardware performance counters to measure events in the memory system,but the measurements remain coarse grained, better suited forperformance summaries than providing instruction level feedback. Theavailability of a low cost, online, and accurate methodology forderiving fine-grained memory behavior profiles can prove extremelyuseful for runtime analysis and optimization of programs.This paper presents a new methodology for Ubiquitous MemoryIntrospection (UMI). It is an online and lightweight mini-simulationmethodology that focuses on simulating short memory access tracesrecorded from frequently executed code regions. The simulations arefast and can provide profiling results at varying granularities, downto that of a single instruction or address. UMI naturally complementsruntime optimizations techniques and enables new opportunities formemory specific optimizations.In this paper, we present a prototype implementation of a runtimesystem implementing UMI. The prototype is readily deployed oncommodity processors, requires no user intervention, and can operatewith stripped binaries and legacy software. The prototype operateswith an average runtime overhead of 20% but this slowdown is only 6%slower than a state of the art binary instrumentation tool. We used32 benchmarks, including the full suite of SPEC2000 benchmarks, forour evaluation. We show that the mini-simulation results accuratelyreflect the cache performance of two existing memory systems, anIntel Pentium~4 and an AMD Athlon MP (K7) processor. We alsodemonstrate that low level profiling information from the onlinesimulation can serve to identify high-miss rate load instructions with a77% rate of accuracy compared to full offline simulations thatrequired days to complete. The online profiling results are used atruntime to implement a simple software prefetching strategy thatachieves a speedup greater than 60% in the best case

    Avoiding Conflicts Dynamically in Direct Mapped Caches with Minimal Hardware Support

    Get PDF
    The memory system is often the weakest link in the performance of today\u27s computers. Cache design has received increasing attention in recent years as increases in CPU performance continues to outpace decreases in memory latency. Bershad et al. proposed a hardware modification called the Cache Miss Lookaside buffer which attempts to dynamically identify data which is conflicting in the cache and remap pages to avoid future conflicts. In a follow-up paper, Bershad et al. tried to modify this idea to work with standard hardware but had less success than with their dedicated hardware. In this thesis, we focus on a modification of these ideas, using less complicated hardware and focusing more on sampling policies. The hardware support is reduced to a buffer of recent cache misses and a cache miss counter. Because determination of remapping candidates is moved to software, sampling policies are studied to reduce overhead which will most likely fall on the OS. Our results show that sampling can be highly effective in identifying conflicts that should be remapped. Finally, we show that the theoretical performance of such a system can compare favorably with more costly higher associativity caches

    Software Grand Exposure: SGX Cache Attacks Are Practical

    Full text link
    Side-channel information leakage is a known limitation of SGX. Researchers have demonstrated that secret-dependent information can be extracted from enclave execution through page-fault access patterns. Consequently, various recent research efforts are actively seeking countermeasures to SGX side-channel attacks. It is widely assumed that SGX may be vulnerable to other side channels, such as cache access pattern monitoring, as well. However, prior to our work, the practicality and the extent of such information leakage was not studied. In this paper we demonstrate that cache-based attacks are indeed a serious threat to the confidentiality of SGX-protected programs. Our goal was to design an attack that is hard to mitigate using known defenses, and therefore we mount our attack without interrupting enclave execution. This approach has major technical challenges, since the existing cache monitoring techniques experience significant noise if the victim process is not interrupted. We designed and implemented novel attack techniques to reduce this noise by leveraging the capabilities of the privileged adversary. Our attacks are able to recover confidential information from SGX enclaves, which we illustrate in two example cases: extraction of an entire RSA-2048 key during RSA decryption, and detection of specific human genome sequences during genomic indexing. We show that our attacks are more effective than previous cache attacks and harder to mitigate than previous SGX side-channel attacks

    Automatic creation of tile size selection models using neural networks

    Get PDF
    2010 Spring.Includes bibliographic references (pages 54-59).Covers not scanned.Print version deaccessioned 2022.Tiling is a widely used loop transformation for exposing/exploiting parallelism and data locality. Effective use of tiling requires selection and tuning of the tile sizes. This is usually achieved by hand-crafting tile size selection (TSS) models that characterize the performance of the tiled program as a function of tile sizes. The best tile sizes are selected by either directly using the TSS model or by using the TSS model together with an empirical search. Hand-crafting accurate TSS models is hard, and adapting them to different architecture/compiler, or even keeping them up-to-date with respect to the evolution of a single compiler is often just as hard. Instead of hand-crafting TSS models, can we automatically learn or create them? In this paper, we show that for a specific class of programs fairly accurate TSS models can be automatically created by using a combination of simple program features, synthetic kernels, and standard machine learning techniques. The automatic TSS model generation scheme can also be directly used for adapting the model and/or keeping it up-to-date. We evaluate our scheme on six different architecture-compiler combinations (chosen from three different architectures and four different compilers). The models learned by our method have consistently shown near-optimal performance (within 5% of the optimal on average) across the tested architecture-compiler combinations

    Support for Programming Models in Network-on-Chip-based Many-core Systems

    Get PDF
    • …
    corecore