Search CORE

6 research outputs found

Cooperative cache scrubbing

Author: Blackburn Stephen M.
Eeckhout Lieven
Heirman Wim
McKinley Kathryn S
Sartor Jennifer
Publication venue: 'American College of Medical Physics (ACMP)'
Publication date: 01/01/2014
Field of study

Managing the limited resources of power and memory bandwidth while improving performance on multicore hardware is challeng-ing. In particular, more cores demand more memory bandwidth, and multi-threaded applications increasingly stress memory sys-tems, leading to more energy consumption. However, we demon-strate that not all memory traffic is necessary. For modern Java pro-grams, 10 to 60 % of DRAM writes are useless, because the data on these lines are dead- the program is guaranteed to never read them again. Furthermore, reading memory only to immediately zero ini-tialize it wastes bandwidth. We propose a software/hardware coop-erative solution: the memory manager communicates dead and zero lines with cache scrubbing instructions. We show how scrubbing instructions satisfy MESI cache coherence protocol invariants and demonstrate them in a Java Virtual Machine and multicore simula-tor. Scrubbing reduces average DRAM traffic by 59%, total DRAM energy by 14%, and dynamic DRAM energy by 57 % on a range of configurations. Cooperative software/hardware cache scrubbing reduces memory bandwidth and improves energy efficiency, two critical problems in modern systems

CiteSeerX

Crossref

Ghent University Academic Bibliography

The Australian National University

Scalable cooperative caching algorithm based on bloom filters

Author: Siddikov Nodirjon
Publication venue: RIT Scholar Works
Publication date: 01/01/2011
Field of study

This thesis presents the design, implementation and evaluation of a novel cooperative caching algorithm based on the bloom filter data structure. The new algorithm uses a decentralized approach to resolve the problems that prevent the existing solutions from being scalable. The problems consist of an overloaded manager, a communication overhead among clients, and a memory overhead on the global cache. The new solution reduces the manager load and the communication overhead by distributing the global cache information among cooperating clients. Thus, the manager no longer maintains the global cache. Furthermore, the memory overhead is decreased due to a bloom filter data structure. The bloom filter saves memory space in the global cache and makes the new algorithm scalable. The correctness of the research hypothesis is verified by running experiments on the caching algorithms. The experiment results demonstrate that the new caching algorithm maintains a low block access time as existing algorithms. In addition, the new algorithm decreases the manager load by the factor of nine. Moreover, the communication overhead is reduced by nearly a factor of six as a result of distributing the global cache to clients. Finally, the results show a significant reduction in the memory overhead which also contributes to the scalability of the new algorithm

CiteSeerX

RIT Scholar Works

Application-Specific Memory Subsystems

Author: Wingbermuehle Joseph George
Publication venue: Washington University Open Scholarship
Publication date: 15/05/2015
Field of study

The disparity in performance between processors and main memories has led computer architects to incorporate large cache hierarchies in modern computers. These cache hierarchies are designed to be general-purpose in that they strive to provide the best possible performance across a wide range of applications. However, such a memory subsystem does not necessarily provide the best possible performance for a particular application. Although general-purpose memory subsystems are desirable when the work-load is unknown and the memory subsystem must remain fixed, when this is not the case a custom memory subsystem may be beneficial. For example, in an application-specific integrated circuit (ASIC) or a field-programmable gate array (FPGA) designed to run a particular application, a custom memory subsystem optimized for that application would be desirable. In addition, when there are tunable parameters in the memory subsystem, it may make sense to change these parameters depending on the application being run. Such a situation arises today with FPGAs and, to a lesser extent, GPUs, and it is plausible that general-purpose computers will begin to support greater flexibility in the memory subsystem in the future. In this dissertation, we first show that it is possible to create application-specific memory subsystems that provide much better performance than a general-purpose memory subsystem. In addition, we show a way to discover such memory subsystems automatically using a superoptimization technique on memory address traces gathered from applications. This allows one to generate a custom memory subsystem with little effort. We next show that our memory subsystem superoptimization technique can be used to optimize for objectives other than performance. As an example, we show that it is possible to reduce the number of writes to the main memory, which can be useful for main memories with limited write durability, such as flash or Phase-Change Memory (PCM). Finally, we show how to superoptimize memory subsystems for streaming applications, which are a class of parallel applications. In particular, we show that, through the use of ScalaPipe, we can author and deploy streaming applications targeting FPGAs with superoptimized memory subsystems. ScalaPipe is a domain-specific language (DSL) embedded in the Scala programming language for generating streaming applications that can be implemented on CPUs and FPGAs. Using the ScalaPipe implementation, we are able to demonstrate actual performance improvements using the superoptimized memory subsystem with applications implemented in hardware

Washington University St. Louis: Open Scholarship

Software-assisted cache mechanisms for embedded systems

Author: Jain Prabhat
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2008
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (leaves 120-135).Embedded systems are increasingly using on-chip caches as part of their on-chip memory system. This thesis presents cache mechanisms to improve cache performance and provide opportunities to improve data availability that can lead to more predictable cache performance. The first cache mechanism presented is an intelligent cache replacement policy that utilizes information about dead data and data that is very frequently used. This mechanism is analyzed theoretically to show that the number of misses using intelligent cache replacement is guaranteed to be no more than the number of misses using traditional LRU replacement. Hardware and software-assisted mechanisms to implement intelligent cache replacement are presented and evaluated. The second cache mechanism presented is that of cache partitioning which exploits disjoint access sequences that do not overlap in the memory space. A theoretical result is proven that shows that modifying an access sequence into a concatenation of disjoint access sequences is guaranteed to improve the cache hit rate. Partitioning mechanisms inspired by the concept of disjoint sequences are designed and evaluated. A profit-based analysis, annotation, and simulation framework has been implemented to evaluate the cache mechanisms. This framework takes a compiled benchmark program and a set of program inputs and evaluates various cache mechanisms to provide a range of possible performance improvement scenarios. The proposed cache mechanisms have been evaluated using this framework by measuring cache miss rates and Instructions Per Clock (IPC) information. The results show that the proposed cache mechanisms show promise in improving cache performance and predictability with a modest increase in silicon area.by Prabhat Jain.Ph.D

DSpace@MIT

Verfahren und Werkzeuge zur Leistungsmessung, -analyse und -bewertung der Ein-/Ausgabeeinheiten von Rechensystemen

Author: Versick Daniel (gnd: 136448909)
Publication venue: Universität Rostock Rostock
Publication date
Field of study

Untersuchungen zeigen, dass die Rechenleistung von Prozessoren stärker steigt als die Ein-/Ausgabeleistung von Sekundärspeichern. Dies führt dazu, dass CPUs ihr Rechenpotential oft nicht ausschöpfen, da sie auf Sekundärspeicherdaten warten. Zur Vermeidung dieser Wartezeiten ist die Leistungsanalyse und -optimierung der Speicher notwendig. I/O-Benchmarks sind Softwarewerkzeuge zur Leistungsanalyse, deren Probleme in dieser Arbeit aufgezeigt und gelöst werden. Es wird ein Ansatz entwickelt, der realitätsnahes, vergleichbares und einfaches I/O-Benchmarking mit der MPI-IO-Schnittstelle ermöglicht

Rostocker Dokumentenserver

Self-Correcting LRU Replacement Policies

Author: Martin Kampe
Michel Dubois
Per Stenstrom
Publication venue
Publication date: 01/01/2004
Field of study

With wider associativity the replacement algorithm becomes critical. Although LRU makes many good replacement decisions, the wide performance gap between OPT, the optimum off-line algorithm, and LRU suggests that LRU still makes too many mistakes. Self-correctin

CiteSeerX

Chalmers Research