5,268 research outputs found
A Cache Model for Modern Processors
Modern processors use high-performance cache replacement policies that outperform traditional alternatives like least-recently used (LRU). Unfortunately, current cache models use stack distances to predict LRU or its variants, and cannot capture these high-performance policies. Accurate predictions of cache performance enable many optimizations in multicore systems. For example, cache partitioning uses these predictions to divide capacity among applications in order to maximize performance, guarantee quality of service, or achieve other system objectives. Without an accurate model for high-performance replacement policies, these optimizations are unavailable to modern processors. We present a new probabilistic cache model designed for high-performance replacement policies. This model uses absolute reuse distances instead of stack distances, which makes it applicable to arbitrary age-based replacement policies. We thoroughly validate our model on several high-performance policies on synthetic and real benchmarks, where its median error is less than 1%. Finally, we present two case studies showing how to use the model to improve shared and single-stream cache performance
Efficient Modeling of Random Sampling-Based LRU Cache
The Miss Ratio Curve (MRC) is an important metric and effective tool for caching system performance prediction and optimization. Since the Least Recently Used (LRU) replacement policy is the de facto policy for many existing caching systems, most previous studies on efficient MRC construction are predominantly focused on the LRU replacement policy. Recently, the random sampling-based replacement mechanism, as opposed to replacement relying on the rigid LRU data structure, gains more popularity due to its lightweight and flexibility. To approximate LRU, at replacement times, the system randomly selects K objects and replaces the least recently used object among the sample. Redis implements this approximated LRU policy. We observe that there can exist a significant miss ratio gap between exact LRU and random sampling-based LRU under different sampling size K; therefore existing LRU MRC construction techniques cannot be directly applied to random sampling based LRU cache without loss of accuracy.
In this thesis, we present a new probabilistic stack algorithm named KRR which can be used to accurately model random sampling based-LRU cache with arbitrary sampling size K. We propose two efficient stack update algorithms which reduce the expected running time of KRR from O(NM) to O(Nlog^2M) and O(NlogM), respectively, where N is the workload length and M is the number of distinct objects. Our implementation generates accurate miss ratio curves for both fixed and variable block size cache. Furthermore, we adopt spatial sampling which further reduces the running time of KRR by several orders of magnitude, and thus enables practical, low overhead online application of KRR
Optimal Eviction Policies for Stochastic Address Traces
The eviction problem for memory hierarchies is studied for the Hidden Markov
Reference Model (HMRM) of the memory trace, showing how miss minimization can
be naturally formulated in the optimal control setting. In addition to the
traditional version assuming a buffer of fixed capacity, a relaxed version is
also considered, in which buffer occupancy can vary and its average is
constrained. Resorting to multiobjective optimization, viewing occupancy as a
cost rather than as a constraint, the optimal eviction policy is obtained by
composing solutions for the individual addressable items.
This approach is then specialized to the Least Recently Used Stack Model
(LRUSM), a type of HMRM often considered for traces, which includes V-1
parameters, where V is the size of the virtual space. A gain optimal policy for
any target average occupancy is obtained which (i) is computable in time O(V)
from the model parameters, (ii) is optimal also for the fixed capacity case,
and (iii) is characterized in terms of priorities, with the name of Least
Profit Rate (LPR) policy. An O(log C) upper bound (being C the buffer capacity)
is derived for the ratio between the expected miss rate of LPR and that of OPT,
the optimal off-line policy; the upper bound is tightened to O(1), under
reasonable constraints on the LRUSM parameters. Using the stack-distance
framework, an algorithm is developed to compute the number of misses incurred
by LPR on a given input trace, simultaneously for all buffer capacities, in
time O(log V) per access.
Finally, some results are provided for miss minimization over a finite
horizon and over an infinite horizon under bias optimality, a criterion more
stringent than gain optimality.Comment: 37 pages, 3 figure
An accurate prefetching policy for object oriented systems
PhD ThesisIn the latest high-performance computers, there is a growing requirement for
accurate prefetching(AP) methodologies for advanced object management schemes
in virtual memory and migration systems. The major issue for achieving this goal is that
of finding a simple way of accurately predicting the objects that will be referenced in
the near future and to group them so as to allow them to be fetched same time. The
basic notion of AP involves building a relationship for logically grouping related
objects and prefetching them, rather than using their physical grouping and it relies on
demand fetching such as is done in existing restructuring or grouping schemes. By this,
AP tries to overcome some of the shortcomings posed by physical grouping methods.
Prefetching also makes use of the properties of object oriented languages to
build inter and intra object relationships as a means of logical grouping. This thesis
describes how this relationship can be established at compile time and how it can be
used for accurate object prefetching in virtual memory systems. In addition, AP
performs control flow and data dependency analysis to reinforce the relationships and
to find the dependencies of a program. The user program is decomposed into
prefetching blocks which contain all the information needed for block prefetching such
as long branches and function calls at major branch points.
The proposed prefetching scheme is implemented by extending a C++
compiler and evaluated on a virtual memory simulator. The results show a significant
reduction both in the number of page fault and memory pollution. In particular, AP
can suppress many page faults that occur during transition phases which are
unmanageable by other ways of fetching. AP can be applied to a local and distributed
virtual memory system so as to reduce the fault rate by fetching groups of objects at the
same time and consequently lessening operating system overheads.British Counci
Study and optimization of the memory management in Memcached
Over the years the Internet has become more popular than ever and web applications
like Facebook and Twitter are gaining more users. This results in generation of more and
more data by the users which has to be efficiently managed, because access speed is an
important factor nowadays, a user will not wait no more than three seconds for a web
page to load before abandoning the site. In-memory key-value stores like Memcached
and Redis are used to speed up web applications by speeding up access to the data by
decreasing the number of accesses to the slower data storage’s. The first implementation
of Memcached, in the LiveJournal’s website, showed that by using 28 instances of Memcached
on ten unique hosts, caching the most popular 30GB of data can achieve a hit rate
around 92%, reducing the number of accesses to the database and reducing the response
time considerably.
Not all objects in cache take the same time to recompute, so this research is going to
study and present a new cost aware memory management that is easy to integrate in a
key-value store, with this approach being implemented in Memcached. The new memory
management and cache will give some priority to key-value pairs that take longer to be
recomputed. Instead of replacing Memcached’s replacement structure and its policy, we
simply add a new segment in each structure that is capable of storing the more costly
key-value pairs. Apart from this new segment in each replacement structure, we created
a new dynamic cost-aware rebalancing policy in Memcached, giving more memory to
store more costly key-value pairs.
With the implementations of our approaches, we were able to offer a prototype that
can be used to research the cost on the caching systems performance. In addition, we
were able to improve in certain scenarios the access latency of the user and the total
recomputation cost of the key-value stored in the system
Memory Management for Emerging Memory Technologies
The Memory Wall, or the gap between CPU speed and main memory latency, is ever increasing. The latency of Dynamic Random-Access Memory (DRAM) is now of the order of hundreds of CPU cycles. Additionally, the DRAM main memory is experiencing power, performance and capacity constraints that limit process technology scaling. On the other hand, the workloads running on such systems are themselves changing due to virtualization and cloud computing demanding more performance of the data centers. Not only do these workloads have larger working set sizes, but they are also changing the way memory gets used, resulting in higher sharing and increased bandwidth demands. New Non-Volatile Memory technologies (NVM) are emerging as an answer to the current main memory issues.
This thesis looks at memory management issues as the emerging memory technologies get integrated into the memory hierarchy. We consider the problems at various levels in the memory hierarchy, including sharing of CPU LLC, traffic management to future non-volatile memories behind the LLC, and extending main memory through the employment of NVM.
The first solution we propose is “Adaptive Replacement and Insertion" (ARI), an adaptive approach to last-level CPU cache management, optimizing the cache miss rate and writeback rate simultaneously. Our specific focus is to reduce writebacks as much as possible while maintaining or improving miss rate relative to conventional LRU replacement policy, with minimal hardware overhead. ARI reduces writebacks on benchmarks from SPEC2006 suite on average by 32.9% while also decreasing misses on average by 4.7%. In a PCM based memory system, this decreases energy consumption by 23% compared to LRU and provides a 49% lifetime improvement beyond what is possible with randomized wear-leveling.
Our second proposal is “Variable-Timeslice Thread Scheduling" (VATS), an OS kernel-level approach to CPU cache sharing. With modern, large, last-level caches (LLC), the time to fill the LLC is greater than the OS scheduling window. As a result, when a thread aggressively thrashes the LLC by replacing much of the data in it, another thread may not be able to recover its working set before being rescheduled. We isolate the threads in time by increasing their allotted time quanta, and allowing larger periods of time between interfering threads. Our approach, compared to conventional scheduling, mitigates up to 100% of the performance loss caused by CPU LLC interference. The system throughput is boosted by up to 15%.
As an unconventional approach to utilizing emerging memory technologies, we present a Ternary Content-Addressable Memory (TCAM) design with Flash transistors. TCAM is successfully used in network routing but can also be utilized in the OS Virtual Memory applications. Based on our layout and circuit simulation experiments, we conclude that our FTCAM block achieves an area improvement of 7.9× and a power improvement of 1.64× compared to a CMOS approach.
In order to lower the cost of Main Memory in systems with huge memory demand, it is becoming practical to extend the DRAM in the system with the less-expensive NVMe Flash, for a much lower system cost. However, given the relatively high Flash devices access latency, naively using them as main memory leads to serious performance degradation. We propose OSVPP, a software-only, OS swap-based page prefetching scheme for managing such hybrid DRAM + NVM systems. We show that it is possible to gain about 50% of the lost performance due to swapping into the NVM and thus enable the utilization of such hybrid systems for memory-hungry applications, lowering the memory cost while keeping the performance comparable to the DRAM-only system
Memory Management for Emerging Memory Technologies
The Memory Wall, or the gap between CPU speed and main memory latency, is ever increasing. The latency of Dynamic Random-Access Memory (DRAM) is now of the order of hundreds of CPU cycles. Additionally, the DRAM main memory is experiencing power, performance and capacity constraints that limit process technology scaling. On the other hand, the workloads running on such systems are themselves changing due to virtualization and cloud computing demanding more performance of the data centers. Not only do these workloads have larger working set sizes, but they are also changing the way memory gets used, resulting in higher sharing and increased bandwidth demands. New Non-Volatile Memory technologies (NVM) are emerging as an answer to the current main memory issues.
This thesis looks at memory management issues as the emerging memory technologies get integrated into the memory hierarchy. We consider the problems at various levels in the memory hierarchy, including sharing of CPU LLC, traffic management to future non-volatile memories behind the LLC, and extending main memory through the employment of NVM.
The first solution we propose is “Adaptive Replacement and Insertion" (ARI), an adaptive approach to last-level CPU cache management, optimizing the cache miss rate and writeback rate simultaneously. Our specific focus is to reduce writebacks as much as possible while maintaining or improving miss rate relative to conventional LRU replacement policy, with minimal hardware overhead. ARI reduces writebacks on benchmarks from SPEC2006 suite on average by 32.9% while also decreasing misses on average by 4.7%. In a PCM based memory system, this decreases energy consumption by 23% compared to LRU and provides a 49% lifetime improvement beyond what is possible with randomized wear-leveling.
Our second proposal is “Variable-Timeslice Thread Scheduling" (VATS), an OS kernel-level approach to CPU cache sharing. With modern, large, last-level caches (LLC), the time to fill the LLC is greater than the OS scheduling window. As a result, when a thread aggressively thrashes the LLC by replacing much of the data in it, another thread may not be able to recover its working set before being rescheduled. We isolate the threads in time by increasing their allotted time quanta, and allowing larger periods of time between interfering threads. Our approach, compared to conventional scheduling, mitigates up to 100% of the performance loss caused by CPU LLC interference. The system throughput is boosted by up to 15%.
As an unconventional approach to utilizing emerging memory technologies, we present a Ternary Content-Addressable Memory (TCAM) design with Flash transistors. TCAM is successfully used in network routing but can also be utilized in the OS Virtual Memory applications. Based on our layout and circuit simulation experiments, we conclude that our FTCAM block achieves an area improvement of 7.9× and a power improvement of 1.64× compared to a CMOS approach.
In order to lower the cost of Main Memory in systems with huge memory demand, it is becoming practical to extend the DRAM in the system with the less-expensive NVMe Flash, for a much lower system cost. However, given the relatively high Flash devices access latency, naively using them as main memory leads to serious performance degradation. We propose OSVPP, a software-only, OS swap-based page prefetching scheme for managing such hybrid DRAM + NVM systems. We show that it is possible to gain about 50% of the lost performance due to swapping into the NVM and thus enable the utilization of such hybrid systems for memory-hungry applications, lowering the memory cost while keeping the performance comparable to the DRAM-only system
Gain More for Less: The Surprising Benefits of QoS Management in Constrained NDN Networks
Quality of Service (QoS) in the IP world mainly manages forwarding resources,
i.e., link capacities and buffer spaces. In addition, Information Centric
Networking (ICN) offers resource dimensions such as in-network caches and
forwarding state. In constrained wireless networks, these resources are scarce
with a potentially high impact due to lossy radio transmission. In this paper,
we explore the two basic service qualities (i) prompt and (ii) reliable traffic
forwarding for the case of NDN. The resources we take into account are
forwarding and queuing priorities, as well as the utilization of caches and of
forwarding state space. We treat QoS resources not only in isolation, but
correlate their use on local nodes and between network members. Network-wide
coordination is based on simple, predefined QoS code points. Our findings
indicate that coordinated QoS management in ICN is more than the sum of its
parts and exceeds the impact QoS can have in the IP world
- …