11 research outputs found

    Divided disk cache and SSD FTL for improving performance in storage

    Get PDF
    Although there are many efficient techniques to minimize the speed gap between processor and the memory, it remains a bottleneck for various commercial implementations. Since secondary memory technologies are much slower than main memory, it is challenging to match memory speed to the processor. Usually, hard disk drives include semiconductor caches to improve their performance. A hit in the disk cache eliminates the mechanical seek time and rotational latency. To further improve performance a divided disk cache, subdivided between metadata and data, has been proposed previously. We propose a new algorithm to apply the SSD that is flash memory-based solid state drive by applying FTL. First, this paper evaluates the performance of such a disk cache via simulations using DiskSim. Then, we perform an experiment to evaluate the performance of the proposed algorithm.clos

    Centaur: Host-Side SSD Caching for Storage Performance Control

    Full text link

    Adaptive Resource Management Techniques for High Performance Multi-Core Architectures

    Get PDF
    Reducing the average memory access time is crucial for improving the performance of applications executing on multi-core architectures. With workload consolidation this becomes increasingly challenging due to shared resource contention. Previous works has proposed techniques for partitioning of shared resources (e.g. cache and bandwidth) and prefetch throttling with the goal of mitigating contention and reducing or hiding average memory access time.Cache partitioning in multi-core architectures is challenging due to the need to determine cache allocations with low computational overhead and the need to place the partitions in a locality-aware manner. The requirement for low computational overhead is important in order to have the capability to scale to large core counts. Previous work within multi-resource management has proposed coordinately managing a subset of the techniques: cache partitioning, bandwidth partitioning and prefetch throttling. However, coordinated management of all three techniques opens up new possible trade-offs and interactions which can be leveraged to gain better performance. This thesis contributes with two different resource management techniques: One resource manger for scalable cache partitioning and a multi-resource management technique for coordinated management of cache partitioning, bandwidth partitioning and prefetching. The scalable resource management technique for cache partitioning uses a distributed and asynchronous cache partitioning algorithm that works together with a flexible NUCA enforcement mechanism in order to give locality-aware placement of data and support fine-grained partitions. The algorithm adapts quickly to application phase changes. The distributed nature of the algorithm together with the low computational complexity, enables the solution to be implemented in hardware and scale to large core counts. The multi-resource management technique for coordinated management of cache partitioning bandwidth partitioning and prefetching is designed using the results from our in-depth characterisation from the entire SPEC CPU2006 suite. The solution consists of three local resource management techniques that together with a coordination mechanism provides allocations which takes the inter-resource interactions and trade-offs into account.Our evaluation shows that the distributed cache partitioning solution performs within 1% from the best known centralized solution, which cannot scale to large core counts. The solution improves performance by 9% and 16%, on average, on a 16 and 64-core multi-core architecture, respectively, compared to a shared last-level cache. The multi-resource management technique gives a performance increase of 11%, on average, over state-of-the-art and improves performance by 50% compared to the baseline 16-core multi-core without cache partitioning, bandwidth partitioning and prefetch throttling

    Optimal Eviction Policies for Stochastic Address Traces

    Full text link
    The eviction problem for memory hierarchies is studied for the Hidden Markov Reference Model (HMRM) of the memory trace, showing how miss minimization can be naturally formulated in the optimal control setting. In addition to the traditional version assuming a buffer of fixed capacity, a relaxed version is also considered, in which buffer occupancy can vary and its average is constrained. Resorting to multiobjective optimization, viewing occupancy as a cost rather than as a constraint, the optimal eviction policy is obtained by composing solutions for the individual addressable items. This approach is then specialized to the Least Recently Used Stack Model (LRUSM), a type of HMRM often considered for traces, which includes V-1 parameters, where V is the size of the virtual space. A gain optimal policy for any target average occupancy is obtained which (i) is computable in time O(V) from the model parameters, (ii) is optimal also for the fixed capacity case, and (iii) is characterized in terms of priorities, with the name of Least Profit Rate (LPR) policy. An O(log C) upper bound (being C the buffer capacity) is derived for the ratio between the expected miss rate of LPR and that of OPT, the optimal off-line policy; the upper bound is tightened to O(1), under reasonable constraints on the LRUSM parameters. Using the stack-distance framework, an algorithm is developed to compute the number of misses incurred by LPR on a given input trace, simultaneously for all buffer capacities, in time O(log V) per access. Finally, some results are provided for miss minimization over a finite horizon and over an infinite horizon under bias optimality, a criterion more stringent than gain optimality.Comment: 37 pages, 3 figure

    Improving Caches in Consolidated Environments

    Get PDF
    Memory (cache, DRAM, and disk) is in charge of providing data and instructions to a computer’s processor. In order to maximize performance, the speeds of the memory and the processor should be equal. However, using memory that always match the speed of the processor is prohibitively expensive. Computer hardware designers have managed to drastically lower the cost of the system with the use of memory caches by sacrificing some performance. A cache is a small piece of fast memory that stores popular data so it can be accessed faster. Modern computers have evolved into a hierarchy of caches, where a memory level is the cache for a larger and slower memory level immediately below it. Thus, by using caches, manufacturers are able to store terabytes of data at the cost of cheapest memory while achieving speeds close to the speed of the fastest one. The most important decision about managing a cache is what data to store in it. Failing to make good decisions can lead to performance overheads and over- provisioning. Surprisingly, caches choose data to store based on policies that have not changed in principle for decades. However, computing paradigms have changed radically leading to two noticeably different trends. First, caches are now consol- idated across hundreds to even thousands of processes. And second, caching is being employed at new levels of the storage hierarchy due to the availability of high-performance flash-based persistent media. This brings four problems. First, as the workloads sharing a cache increase, it is more likely that they contain dupli- cated data. Second, consolidation creates contention for caches, and if not managed carefully, it translates to wasted space and sub-optimal performance. Third, as contented caches are shared by more workloads, administrators need to carefully estimate specific per-workload requirements across the entire memory hierarchy in order to meet per-workload performance goals. And finally, current cache write poli- cies are unable to simultaneously provide performance and consistency guarantees for the new levels of the storage hierarchy. We addressed these problems by modeling their impact and by proposing solu- tions for each of them. First, we measured and modeled the amount of duplication at the buffer cache level and contention in real production systems. Second, we created a unified model of workload cache usage under contention to be used by administrators for provisioning, or by process schedulers to decide what processes to run together. Third, we proposed methods for removing cache duplication and to eliminate wasted space because of contention for space. And finally, we pro- posed a technique to improve the consistency guarantees of write-back caches while preserving their performance benefits

    Paging on Complex Architectures

    Get PDF
    Advances in technology allow to build computer systems of ever increasing performances and capabilities. However, the effective use of such computational resources is often made difficult by the complexity of the system itself. Crucial to the performance of a computing device is the orchestration of the flow of data across the memory hierarchy. Specifically, given a fast but small memory (a cache) through which all the data that have to be processed must pass, it is necessary to establish a set of rules, then implemented by an algorithm, that define which data has to be evicted from such a memory to make room for new incoming data. The goal is that of minimizing the number of times that requested data is outside the cache (faults), since fetching data from farther levels of the memory hierarchy incurs high costs, in terms of time and also of energy. This thesis studies two generalizations of this problem, known as the paging problem. This problem is intrinsically online, as future data requests issued by a computer program are typically unknown. Motivated by the recent diffusion of multi-threaded and multi-core architectures, whereby several threads or processes can be executed simultaneously, and/or there are several processing units, and by the recent and rapidly growing interest in reducing power consumptions of computer systems, in the first part of the thesis we study a variation of paging which rewards the efficient usage of memory resources. In this problem the goal is that of minimizing a combination of both the number of faults and the cache occupancy of the process' data in fast memory. The main results of this part are two: the first is an impossibility result that indicates that, roughly speaking, online algorithms cannot compete in practice with algorithms that know in advance all the data requests issued by the process; the second is the design of an online algorithm that has almost the best performance among all the possible online algorithms. In the second part of the thesis we concentrate on the management of a cache shared among several concurrent processes. As outlined above, this has direct application in multi-threaded or multi-core architectures. In this problem the fast memory has to service a sequence of requests which is the interleaving of the requests issued by t different processes. Through its replacement decisions, the algorithm dynamically allocates the cache space among the processes, and this clearly impacts their progress. The main goal here is to minimize the time needed to complete the service of all the request sequences. We show tight lower and upper bounds on the performance of online algorithms for several variants of the problem

    Adaptive Microarchitectural Optimizations to Improve Performance and Security of Multi-Core Architectures

    Get PDF
    With the current technological barriers, microarchitectural optimizations are increasingly important to ensure performance scalability of computing systems. The shift to multi-core architectures increases the demands on the memory system, and amplifies the role of microarchitectural optimizations in performance improvement. In a multi-core system, microarchitectural resources are usually shared, such as the cache, to maximize utilization but sharing can also lead to contention and lower performance. This can be mitigated through partitioning of shared caches.However, microarchitectural optimizations which were assumed to be fundamentally secure for a long time, can be used in side-channel attacks to exploit secrets, as cryptographic keys. Timing-based side-channels exploit predictable timing variations due to the interaction with microarchitectural optimizations during program execution. Going forward, there is a strong need to be able to leverage microarchitectural optimizations for performance without compromising security. This thesis contributes with three adaptive microarchitectural resource management optimizations to improve security and/or\ua0performance\ua0of multi-core architectures\ua0and a systematization-of-knowledge of timing-based side-channel attacks.\ua0We observe that to achieve high-performance cache partitioning in a multi-core system\ua0three requirements need to be met: i) fine-granularity of partitions, ii) locality-aware placement and iii) frequent changes. These requirements lead to\ua0high overheads for current centralized partitioning solutions, especially as the number of cores in the\ua0system increases. To address this problem, we present an adaptive and scalable cache partitioning solution (DELTA) using a distributed and asynchronous allocation algorithm. The\ua0allocations occur through core-to-core challenges, where applications with larger performance benefit will gain cache capacity. The\ua0solution is implementable in hardware, due to low computational complexity, and can scale to large core counts.According to our analysis, better performance can be achieved by coordination of multiple optimizations for different resources, e.g., off-chip bandwidth and cache, but is challenging due to the increased number of possible allocations which need to be evaluated.\ua0Based on these observations, we present a solution (CBP) for coordinated management of the optimizations: cache partitioning, bandwidth partitioning and prefetching.\ua0Efficient allocations, considering the inter-resource interactions and trade-offs, are achieved using local resource managers to limit the solution space.The continuously growing number of\ua0side-channel attacks leveraging\ua0microarchitectural optimizations prompts us to review attacks and defenses to understand the vulnerabilities of different microarchitectural optimizations. We identify the four root causes of timing-based side-channel attacks: determinism, sharing, access violation\ua0and information flow.\ua0Our key insight is that eliminating any of the exploited root causes, in any of the attack steps, is enough to provide protection.\ua0Based on our framework, we present a systematization of the attacks and defenses on a wide range of microarchitectural optimizations, which highlights their key similarities.\ua0Shared caches are an attractive attack surface for side-channel attacks, while defenses need to be efficient since the cache is crucial for performance.\ua0To address this issue, we present an adaptive and scalable cache partitioning solution (SCALE) for protection against cache side-channel attacks. The solution leverages randomness,\ua0and provides quantifiable and information theoretic security guarantees using differential privacy. The solution closes the performance gap to a state-of-the-art non-secure allocation policy for a mix of secure and non-secure applications

    Design of disk cache for high performance computing.

    Get PDF
    by Vincent, Kwan Chi Wai.Thesis (M.Phil.)--Chinese University of Hong Kong, 1995.Includes bibliographical references (leaves 123-127).Abstract --- p.iAcknowledgement --- p.iiList of Tables --- p.viiList of Figures --- p.viiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- I/O System --- p.2Chapter 1.2 --- Disk Cache --- p.4Chapter 1.3 --- Dissertation Outline --- p.5Chapter 2 --- Related Work --- p.7Chapter 2.1 --- Prefetching --- p.7Chapter 2.2 --- Cache Partitioning --- p.9Chapter 2.2.1 --- Hardware Assisted Mechanism --- p.9Chapter 2.2.2 --- Software Assisted Mechanism --- p.10Chapter 2.3 --- Replacement Policy --- p.12Chapter 2.4 --- Caching Write Operation --- p.13Chapter 2.5 --- Others --- p.14Chapter 2.6 --- Summary --- p.15Chapter 3 --- Methodology and Models --- p.17Chapter 3.1 --- Performance Measurement --- p.17Chapter 3.1.1 --- Partial Hit --- p.17Chapter 3.1.2 --- Time Model --- p.17Chapter 3.2 --- Terminology --- p.19Chapter 3.2.1 --- Transfer Block --- p.19Chapter 3.2.2 --- Multiple-sector Request --- p.19Chapter 3.2.3 --- "Dynamic Block, Heading Sectors and Content Sectors" --- p.20Chapter 3.2.4 --- Heading Reuse and Non-heading Reuse --- p.22Chapter 3.3 --- New Models --- p.23Chapter 3.3.1 --- Unified Cache with Always Prefetch --- p.24Chapter 3.3.2 --- Partitioned Cache: Branch Target Cache and Prefetch Buffer --- p.25Chapter 3.3.3 --- BTC + PB with Alternative Storing Sector Technique --- p.29Chapter 3.3.4 --- BTC + PB with ASST Applying to Dynamic Block --- p.34Chapter 3.3.5 --- BTC + PB with Storing Enough Head Technique --- p.35Chapter 3.4 --- Impact of Block Size --- p.38Chapter 4 --- Trace Driven Simulation --- p.41Chapter 4.1 --- Simulation Environment --- p.41Chapter 4.2 --- Two Kinds Of Disk --- p.43Chapter 4.3 --- Control Models --- p.43Chapter 4.3.1 --- Model 1: No Cache --- p.43Chapter 4.3.2 --- Model 2: Unified Cache without Prefetch --- p.44Chapter 4.3.3 --- Model 3: Unified Cache with Prefetch on Miss --- p.44Chapter 4.4 --- Two Comparison Standards --- p.45Chapter 4.5 --- Trace Properties --- p.46Chapter 5 --- Performance Evaluation of Common Disk --- p.54Chapter 5.1 --- The Effect Of Cache Size --- p.54Chapter 5.1.1 --- Trends of Absolute Reduction in Time --- p.55Chapter 5.1.2 --- Trends of Relative Reduction in Time --- p.55Chapter 5.2 --- The Effect Of Block Size --- p.68Chapter 5.2.1 --- Trends of Absolute Reduction in Time --- p.68Chapter 5.2.2 --- Trends of Relative Reduction in Time --- p.73Chapter 5.3 --- The Effect Of Set Associativity --- p.77Chapter 5.3.1 --- Trends of Absolute Reduction in Time --- p.77Chapter 5.4 --- The Effect Of Start-up Time C1 --- p.79Chapter 5.4.1 --- Trends of Absolute Reduction in Time --- p.80Chapter 5.4.2 --- Trends of Relative Reduction in Time --- p.80Chapter 5.5 --- The Effect Of Transfer Time C2 --- p.83Chapter 5.5.1 --- Trends of Absolute Reduction in Time --- p.83Chapter 5.5.2 --- Trends of Relative Reduction in Time --- p.83Chapter 5.5.3 --- Impact of C2=0.5 on Cache Size --- p.86Chapter 5.5.4 --- Impact of C2=0.5 on Block Size --- p.87Chapter 5.6 --- The Effect Of Prefetch Buffer Size --- p.90Chapter 5.7 --- Others --- p.93Chapter 5.7.1 --- In The Case of Very Small Cache with Large Block Size --- p.93Chapter 5.7.2 --- Comparing Performance of Model 6 and Model 7 --- p.94Chapter 5.8 --- Conclusion --- p.95Chapter 5.8.1 --- The Number of Actual Sectors Transferred between Disk and Cache . --- p.95Chapter 5.8.2 --- The Efficiency of Our Models on Common Disk --- p.96Chapter 6 --- Performance Evaluation of High Performance Disk --- p.98Chapter 6.1 --- Difference Between Common Disk And High Performance Disk --- p.98Chapter 6.2 --- The Effect Of Cache Size --- p.99Chapter 6.2.1 --- Trends of Absolute Reduction in Time --- p.99Chapter 6.2.2 --- Trends of Relative Reduction in Time --- p.99Chapter 6.3 --- The Effect Of Block Size --- p.103Chapter 6.3.1 --- Trends of Absolute Reduction in Time --- p.105Chapter 6.3.2 --- Trends of Relative Reduction in Time --- p.105Chapter 6.4 --- The Effect Of Start-up Time C1 --- p.110Chapter 6.4.1 --- Trends of Relative Reduction in Time --- p.110Chapter 6.5 --- The Effect Of Transfer Time C2 --- p.110Chapter 6.5.1 --- Trends of Relative Reduction in Time --- p.112Chapter 6.5.2 --- Impact of C2=0.5 on Cache Size --- p.112Chapter 6.5.3 --- Impact of C2=0.5 on Block Size --- p.116Chapter 6.6 --- Conclusion --- p.117Chapter 7 --- Conclusions and Future Work --- p.119Chapter 7.1 --- Conclusions --- p.119Chapter 7.2 --- Future Work --- p.122Bibliography --- p.12

    Software-assisted cache mechanisms for embedded systems

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (leaves 120-135).Embedded systems are increasingly using on-chip caches as part of their on-chip memory system. This thesis presents cache mechanisms to improve cache performance and provide opportunities to improve data availability that can lead to more predictable cache performance. The first cache mechanism presented is an intelligent cache replacement policy that utilizes information about dead data and data that is very frequently used. This mechanism is analyzed theoretically to show that the number of misses using intelligent cache replacement is guaranteed to be no more than the number of misses using traditional LRU replacement. Hardware and software-assisted mechanisms to implement intelligent cache replacement are presented and evaluated. The second cache mechanism presented is that of cache partitioning which exploits disjoint access sequences that do not overlap in the memory space. A theoretical result is proven that shows that modifying an access sequence into a concatenation of disjoint access sequences is guaranteed to improve the cache hit rate. Partitioning mechanisms inspired by the concept of disjoint sequences are designed and evaluated. A profit-based analysis, annotation, and simulation framework has been implemented to evaluate the cache mechanisms. This framework takes a compiled benchmark program and a set of program inputs and evaluates various cache mechanisms to provide a range of possible performance improvement scenarios. The proposed cache mechanisms have been evaluated using this framework by measuring cache miss rates and Instructions Per Clock (IPC) information. The results show that the proposed cache mechanisms show promise in improving cache performance and predictability with a modest increase in silicon area.by Prabhat Jain.Ph.D

    Monitoring, analysis and optimisation of I/O in parallel applications

    Get PDF
    High performance computing (HPC) is changing the way science is performed in the 21st Century; experiments that once took enormous amounts of time, were dangerous and often produced inaccurate results can now be performed and refined in a fraction of the time in a simulation environment. Current generation supercomputers are running in excess of 1016 floating point operations per second, and the push towards exascale will see this increase by two orders of magnitude. To achieve this level of performance it is thought that applications may have to scale to potentially billions of simultaneous threads, pushing hardware to its limits and severely impacting failure rates. To reduce the cost of these failures, many applications use checkpointing to periodically save their state to persistent storage, such that, in the event of a failure, computation can be restarted without significant data loss. As computational power has grown by approximately 2x every 18 ? 24 months, persistent storage has lagged behind; checkpointing is fast becoming a bottleneck to performance. Several software and hardware solutions have been presented to solve the current I/O problem being experienced in the HPC community and this thesis examines some of these. Specifically, this thesis presents a tool designed for analysing and optimising the I/O behaviour of scientific applications, as well as a tool designed to allow the rapid analysis of one software solution to the problem of parallel I/O, namely the parallel log-structured file system (PLFS). This thesis ends with an analysis of a modern Lustre file system under contention from multiple applications and multiple compute nodes running the same problem through PLFS. The results and analysis presented outline a framework through which application settings and procurement decisions can be made