5,268 research outputs found

    A Cache Model for Modern Processors

    Get PDF
    Modern processors use high-performance cache replacement policies that outperform traditional alternatives like least-recently used (LRU). Unfortunately, current cache models use stack distances to predict LRU or its variants, and cannot capture these high-performance policies. Accurate predictions of cache performance enable many optimizations in multicore systems. For example, cache partitioning uses these predictions to divide capacity among applications in order to maximize performance, guarantee quality of service, or achieve other system objectives. Without an accurate model for high-performance replacement policies, these optimizations are unavailable to modern processors. We present a new probabilistic cache model designed for high-performance replacement policies. This model uses absolute reuse distances instead of stack distances, which makes it applicable to arbitrary age-based replacement policies. We thoroughly validate our model on several high-performance policies on synthetic and real benchmarks, where its median error is less than 1%. Finally, we present two case studies showing how to use the model to improve shared and single-stream cache performance

    Efficient Modeling of Random Sampling-Based LRU Cache

    Get PDF
    The Miss Ratio Curve (MRC) is an important metric and effective tool for caching system performance prediction and optimization. Since the Least Recently Used (LRU) replacement policy is the de facto policy for many existing caching systems, most previous studies on efficient MRC construction are predominantly focused on the LRU replacement policy. Recently, the random sampling-based replacement mechanism, as opposed to replacement relying on the rigid LRU data structure, gains more popularity due to its lightweight and flexibility. To approximate LRU, at replacement times, the system randomly selects K objects and replaces the least recently used object among the sample. Redis implements this approximated LRU policy. We observe that there can exist a significant miss ratio gap between exact LRU and random sampling-based LRU under different sampling size K; therefore existing LRU MRC construction techniques cannot be directly applied to random sampling based LRU cache without loss of accuracy. In this thesis, we present a new probabilistic stack algorithm named KRR which can be used to accurately model random sampling based-LRU cache with arbitrary sampling size K. We propose two efficient stack update algorithms which reduce the expected running time of KRR from O(NM) to O(Nlog^2M) and O(NlogM), respectively, where N is the workload length and M is the number of distinct objects. Our implementation generates accurate miss ratio curves for both fixed and variable block size cache. Furthermore, we adopt spatial sampling which further reduces the running time of KRR by several orders of magnitude, and thus enables practical, low overhead online application of KRR

    Optimal Eviction Policies for Stochastic Address Traces

    Full text link
    The eviction problem for memory hierarchies is studied for the Hidden Markov Reference Model (HMRM) of the memory trace, showing how miss minimization can be naturally formulated in the optimal control setting. In addition to the traditional version assuming a buffer of fixed capacity, a relaxed version is also considered, in which buffer occupancy can vary and its average is constrained. Resorting to multiobjective optimization, viewing occupancy as a cost rather than as a constraint, the optimal eviction policy is obtained by composing solutions for the individual addressable items. This approach is then specialized to the Least Recently Used Stack Model (LRUSM), a type of HMRM often considered for traces, which includes V-1 parameters, where V is the size of the virtual space. A gain optimal policy for any target average occupancy is obtained which (i) is computable in time O(V) from the model parameters, (ii) is optimal also for the fixed capacity case, and (iii) is characterized in terms of priorities, with the name of Least Profit Rate (LPR) policy. An O(log C) upper bound (being C the buffer capacity) is derived for the ratio between the expected miss rate of LPR and that of OPT, the optimal off-line policy; the upper bound is tightened to O(1), under reasonable constraints on the LRUSM parameters. Using the stack-distance framework, an algorithm is developed to compute the number of misses incurred by LPR on a given input trace, simultaneously for all buffer capacities, in time O(log V) per access. Finally, some results are provided for miss minimization over a finite horizon and over an infinite horizon under bias optimality, a criterion more stringent than gain optimality.Comment: 37 pages, 3 figure

    A study of paging algorithms.

    Get PDF

    An accurate prefetching policy for object oriented systems

    Get PDF
    PhD ThesisIn the latest high-performance computers, there is a growing requirement for accurate prefetching(AP) methodologies for advanced object management schemes in virtual memory and migration systems. The major issue for achieving this goal is that of finding a simple way of accurately predicting the objects that will be referenced in the near future and to group them so as to allow them to be fetched same time. The basic notion of AP involves building a relationship for logically grouping related objects and prefetching them, rather than using their physical grouping and it relies on demand fetching such as is done in existing restructuring or grouping schemes. By this, AP tries to overcome some of the shortcomings posed by physical grouping methods. Prefetching also makes use of the properties of object oriented languages to build inter and intra object relationships as a means of logical grouping. This thesis describes how this relationship can be established at compile time and how it can be used for accurate object prefetching in virtual memory systems. In addition, AP performs control flow and data dependency analysis to reinforce the relationships and to find the dependencies of a program. The user program is decomposed into prefetching blocks which contain all the information needed for block prefetching such as long branches and function calls at major branch points. The proposed prefetching scheme is implemented by extending a C++ compiler and evaluated on a virtual memory simulator. The results show a significant reduction both in the number of page fault and memory pollution. In particular, AP can suppress many page faults that occur during transition phases which are unmanageable by other ways of fetching. AP can be applied to a local and distributed virtual memory system so as to reduce the fault rate by fetching groups of objects at the same time and consequently lessening operating system overheads.British Counci

    Working Sets Past and Present

    Get PDF

    Study and optimization of the memory management in Memcached

    Get PDF
    Over the years the Internet has become more popular than ever and web applications like Facebook and Twitter are gaining more users. This results in generation of more and more data by the users which has to be efficiently managed, because access speed is an important factor nowadays, a user will not wait no more than three seconds for a web page to load before abandoning the site. In-memory key-value stores like Memcached and Redis are used to speed up web applications by speeding up access to the data by decreasing the number of accesses to the slower data storage’s. The first implementation of Memcached, in the LiveJournal’s website, showed that by using 28 instances of Memcached on ten unique hosts, caching the most popular 30GB of data can achieve a hit rate around 92%, reducing the number of accesses to the database and reducing the response time considerably. Not all objects in cache take the same time to recompute, so this research is going to study and present a new cost aware memory management that is easy to integrate in a key-value store, with this approach being implemented in Memcached. The new memory management and cache will give some priority to key-value pairs that take longer to be recomputed. Instead of replacing Memcached’s replacement structure and its policy, we simply add a new segment in each structure that is capable of storing the more costly key-value pairs. Apart from this new segment in each replacement structure, we created a new dynamic cost-aware rebalancing policy in Memcached, giving more memory to store more costly key-value pairs. With the implementations of our approaches, we were able to offer a prototype that can be used to research the cost on the caching systems performance. In addition, we were able to improve in certain scenarios the access latency of the user and the total recomputation cost of the key-value stored in the system

    Memory Management for Emerging Memory Technologies

    Get PDF
    The Memory Wall, or the gap between CPU speed and main memory latency, is ever increasing. The latency of Dynamic Random-Access Memory (DRAM) is now of the order of hundreds of CPU cycles. Additionally, the DRAM main memory is experiencing power, performance and capacity constraints that limit process technology scaling. On the other hand, the workloads running on such systems are themselves changing due to virtualization and cloud computing demanding more performance of the data centers. Not only do these workloads have larger working set sizes, but they are also changing the way memory gets used, resulting in higher sharing and increased bandwidth demands. New Non-Volatile Memory technologies (NVM) are emerging as an answer to the current main memory issues. This thesis looks at memory management issues as the emerging memory technologies get integrated into the memory hierarchy. We consider the problems at various levels in the memory hierarchy, including sharing of CPU LLC, traffic management to future non-volatile memories behind the LLC, and extending main memory through the employment of NVM. The first solution we propose is “Adaptive Replacement and Insertion" (ARI), an adaptive approach to last-level CPU cache management, optimizing the cache miss rate and writeback rate simultaneously. Our specific focus is to reduce writebacks as much as possible while maintaining or improving miss rate relative to conventional LRU replacement policy, with minimal hardware overhead. ARI reduces writebacks on benchmarks from SPEC2006 suite on average by 32.9% while also decreasing misses on average by 4.7%. In a PCM based memory system, this decreases energy consumption by 23% compared to LRU and provides a 49% lifetime improvement beyond what is possible with randomized wear-leveling. Our second proposal is “Variable-Timeslice Thread Scheduling" (VATS), an OS kernel-level approach to CPU cache sharing. With modern, large, last-level caches (LLC), the time to fill the LLC is greater than the OS scheduling window. As a result, when a thread aggressively thrashes the LLC by replacing much of the data in it, another thread may not be able to recover its working set before being rescheduled. We isolate the threads in time by increasing their allotted time quanta, and allowing larger periods of time between interfering threads. Our approach, compared to conventional scheduling, mitigates up to 100% of the performance loss caused by CPU LLC interference. The system throughput is boosted by up to 15%. As an unconventional approach to utilizing emerging memory technologies, we present a Ternary Content-Addressable Memory (TCAM) design with Flash transistors. TCAM is successfully used in network routing but can also be utilized in the OS Virtual Memory applications. Based on our layout and circuit simulation experiments, we conclude that our FTCAM block achieves an area improvement of 7.9× and a power improvement of 1.64× compared to a CMOS approach. In order to lower the cost of Main Memory in systems with huge memory demand, it is becoming practical to extend the DRAM in the system with the less-expensive NVMe Flash, for a much lower system cost. However, given the relatively high Flash devices access latency, naively using them as main memory leads to serious performance degradation. We propose OSVPP, a software-only, OS swap-based page prefetching scheme for managing such hybrid DRAM + NVM systems. We show that it is possible to gain about 50% of the lost performance due to swapping into the NVM and thus enable the utilization of such hybrid systems for memory-hungry applications, lowering the memory cost while keeping the performance comparable to the DRAM-only system

    Memory Management for Emerging Memory Technologies

    Get PDF
    The Memory Wall, or the gap between CPU speed and main memory latency, is ever increasing. The latency of Dynamic Random-Access Memory (DRAM) is now of the order of hundreds of CPU cycles. Additionally, the DRAM main memory is experiencing power, performance and capacity constraints that limit process technology scaling. On the other hand, the workloads running on such systems are themselves changing due to virtualization and cloud computing demanding more performance of the data centers. Not only do these workloads have larger working set sizes, but they are also changing the way memory gets used, resulting in higher sharing and increased bandwidth demands. New Non-Volatile Memory technologies (NVM) are emerging as an answer to the current main memory issues. This thesis looks at memory management issues as the emerging memory technologies get integrated into the memory hierarchy. We consider the problems at various levels in the memory hierarchy, including sharing of CPU LLC, traffic management to future non-volatile memories behind the LLC, and extending main memory through the employment of NVM. The first solution we propose is “Adaptive Replacement and Insertion" (ARI), an adaptive approach to last-level CPU cache management, optimizing the cache miss rate and writeback rate simultaneously. Our specific focus is to reduce writebacks as much as possible while maintaining or improving miss rate relative to conventional LRU replacement policy, with minimal hardware overhead. ARI reduces writebacks on benchmarks from SPEC2006 suite on average by 32.9% while also decreasing misses on average by 4.7%. In a PCM based memory system, this decreases energy consumption by 23% compared to LRU and provides a 49% lifetime improvement beyond what is possible with randomized wear-leveling. Our second proposal is “Variable-Timeslice Thread Scheduling" (VATS), an OS kernel-level approach to CPU cache sharing. With modern, large, last-level caches (LLC), the time to fill the LLC is greater than the OS scheduling window. As a result, when a thread aggressively thrashes the LLC by replacing much of the data in it, another thread may not be able to recover its working set before being rescheduled. We isolate the threads in time by increasing their allotted time quanta, and allowing larger periods of time between interfering threads. Our approach, compared to conventional scheduling, mitigates up to 100% of the performance loss caused by CPU LLC interference. The system throughput is boosted by up to 15%. As an unconventional approach to utilizing emerging memory technologies, we present a Ternary Content-Addressable Memory (TCAM) design with Flash transistors. TCAM is successfully used in network routing but can also be utilized in the OS Virtual Memory applications. Based on our layout and circuit simulation experiments, we conclude that our FTCAM block achieves an area improvement of 7.9× and a power improvement of 1.64× compared to a CMOS approach. In order to lower the cost of Main Memory in systems with huge memory demand, it is becoming practical to extend the DRAM in the system with the less-expensive NVMe Flash, for a much lower system cost. However, given the relatively high Flash devices access latency, naively using them as main memory leads to serious performance degradation. We propose OSVPP, a software-only, OS swap-based page prefetching scheme for managing such hybrid DRAM + NVM systems. We show that it is possible to gain about 50% of the lost performance due to swapping into the NVM and thus enable the utilization of such hybrid systems for memory-hungry applications, lowering the memory cost while keeping the performance comparable to the DRAM-only system

    Gain More for Less: The Surprising Benefits of QoS Management in Constrained NDN Networks

    Full text link
    Quality of Service (QoS) in the IP world mainly manages forwarding resources, i.e., link capacities and buffer spaces. In addition, Information Centric Networking (ICN) offers resource dimensions such as in-network caches and forwarding state. In constrained wireless networks, these resources are scarce with a potentially high impact due to lossy radio transmission. In this paper, we explore the two basic service qualities (i) prompt and (ii) reliable traffic forwarding for the case of NDN. The resources we take into account are forwarding and queuing priorities, as well as the utilization of caches and of forwarding state space. We treat QoS resources not only in isolation, but correlate their use on local nodes and between network members. Network-wide coordination is based on simple, predefined QoS code points. Our findings indicate that coordinated QoS management in ICN is more than the sum of its parts and exceeds the impact QoS can have in the IP world
    • …
    corecore