23 research outputs found

    ๋ฐ์ดํ„ฐ ์ง‘์•ฝ์  ์‘์šฉ์˜ ํšจ์œจ์ ์ธ ์‹œ์Šคํ…œ ์ž์› ํ™œ์šฉ์„ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์„œ๋ธŒ์‹œ์Šคํ…œ ์ตœ์ ํ™”

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. ์—ผํ—Œ์˜.With explosive data growth, data-intensive applications, such as relational database and key-value storage, have been increasingly popular in a variety of domains in recent years. To meet the growing performance demands of data-intensive applications, it is crucial to efficiently and fully utilize memory resources for the best possible performance. However, general-purpose operating systems (OSs) are designed to provide system resources to applications running on a system in a fair manner at system-level. A single application may find it difficult to fully exploit the systems best performance due to this system-level fairness. For performance reasons, many data-intensive applications implement their own mechanisms that OSs already provide, under the assumption that they know better about the data than OSs. They can be greedily optimized for performance but this may result in inefficient use of system resources. In this dissertation, we claim that simple OS support with minor application modifications can yield even higher application performance without sacrificing system-level resource utilization. We optimize and extend OS memory subsystem for better supporting applications while addressing three memory-related issues in data-intensive applications. First, we introduce a memory-efficient cooperative caching approach between application and kernel buffer to address double caching problem where the same data resides in multiple layers. Second, we present a memory-efficient, transparent zero-copy read I/O scheme to avoid the performance interference problem caused by memory copy behavior during I/O. Third, we propose a memory-efficient fork-based checkpointing mechanism for in-memory database systems to mitigate the memory footprint problem of the existing fork-based checkpointing scheme; memory usage increases incrementally (up to 2x) during checkpointing for update-intensive workloads. To show the effectiveness of our approach, we implement and evaluate our schemes on real multi-core systems. The experimental results demonstrate that our cooperative approach can more effectively address the above issues related to data-intensive applications than existing non-cooperative approaches while delivering better performance (in terms of transaction processing speed, I/O throughput, or memory footprint).์ตœ๊ทผ ํญ๋ฐœ์ ์ธ ๋ฐ์ดํ„ฐ ์„ฑ์žฅ๊ณผ ๋”๋ถˆ์–ด ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค, ํ‚ค-๋ฐธ๋ฅ˜ ์Šคํ† ๋ฆฌ์ง€ ๋“ฑ์˜ ๋ฐ์ดํ„ฐ ์ง‘์•ฝ์ ์ธ ์‘์šฉ๋“ค์ด ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์—์„œ ์ธ๊ธฐ๋ฅผ ์–ป๊ณ  ์žˆ๋‹ค. ๋ฐ์ดํ„ฐ ์ง‘์•ฝ์ ์ธ ์‘์šฉ์˜ ๋†’์€ ์„ฑ๋Šฅ ์š”๊ตฌ๋ฅผ ์ถฉ์กฑํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ฃผ์–ด์ง„ ๋ฉ”๋ชจ๋ฆฌ ์ž์›์„ ํšจ์œจ์ ์ด๊ณ  ์™„๋ฒฝํ•˜๊ฒŒ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ๋ฒ”์šฉ ์šด์˜์ฒด์ œ(OS)๋Š” ์‹œ์Šคํ…œ์—์„œ ์ˆ˜ํ–‰ ์ค‘์ธ ๋ชจ๋“  ์‘์šฉ๋“ค์— ๋Œ€ํ•ด ์‹œ์Šคํ…œ ์ฐจ์›์—์„œ ๊ณตํ‰ํ•˜๊ฒŒ ์ž์›์„ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์„ ์šฐ์„ ํ•˜๋„๋ก ์„ค๊ณ„๋˜์–ด์žˆ๋‹ค. ์ฆ‰, ์‹œ์Šคํ…œ ์ฐจ์›์˜ ๊ณตํ‰์„ฑ ์œ ์ง€๋ฅผ ์œ„ํ•œ ์šด์˜์ฒด์ œ ์ง€์›์˜ ํ•œ๊ณ„๋กœ ์ธํ•ด ๋‹จ์ผ ์‘์šฉ์€ ์‹œ์Šคํ…œ์˜ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ์™„์ „ํžˆ ํ™œ์šฉํ•˜๊ธฐ ์–ด๋ ต๋‹ค. ์ด๋Ÿฌํ•œ ์ด์œ ๋กœ, ๋งŽ์€ ๋ฐ์ดํ„ฐ ์ง‘์•ฝ์  ์‘์šฉ์€ ์šด์˜์ฒด์ œ์—์„œ ์ œ๊ณตํ•˜๋Š” ๊ธฐ๋Šฅ์— ์˜์ง€ํ•˜์ง€ ์•Š๊ณ  ๋น„์Šทํ•œ ๊ธฐ๋Šฅ์„ ์‘์šฉ ๋ ˆ๋ฒจ์— ๊ตฌํ˜„ํ•˜๊ณค ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ์ ‘๊ทผ ๋ฐฉ๋ฒ•์€ ํƒ์š•์ ์ธ ์ตœ์ ํ™”๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ ์—์„œ ์„ฑ๋Šฅ ์ƒ ์ด๋“์ด ์žˆ์„ ์ˆ˜ ์žˆ์ง€๋งŒ, ์‹œ์Šคํ…œ ์ž์›์˜ ๋น„ํšจ์œจ์ ์ธ ์‚ฌ์šฉ์„ ์ดˆ๋ž˜ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์šด์˜์ฒด์ œ์˜ ์ง€์›๊ณผ ์•ฝ๊ฐ„์˜ ์‘์šฉ ์ˆ˜์ •๋งŒ์œผ๋กœ๋„ ๋น„ํšจ์œจ์ ์ธ ์‹œ์Šคํ…œ ์ž์› ์‚ฌ์šฉ ์—†์ด ๋ณด๋‹ค ๋†’์€ ์‘์šฉ ์„ฑ๋Šฅ์„ ๋ณด์ผ ์ˆ˜ ์žˆ์Œ์„ ์ฆ๋ช…ํ•˜๊ณ ์ž ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๊ธฐ ์œ„ํ•ด ์šด์˜์ฒด์ œ์˜ ๋ฉ”๋ชจ๋ฆฌ ์„œ๋ธŒ์‹œ์Šคํ…œ์„ ์ตœ์ ํ™” ๋ฐ ํ™•์žฅํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์ง‘์•ฝ์ ์ธ ์‘์šฉ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์„ธ ๊ฐ€์ง€ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ จ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์˜€๋‹ค. ์ฒซ์งธ, ๋™์ผํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์—ฌ๋Ÿฌ ๊ณ„์ธต์— ์กด์žฌํ•˜๋Š” ์ค‘๋ณต ์บ์‹ฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์‘์šฉ๊ณผ ์ปค๋„ ๋ฒ„ํผ ๊ฐ„์— ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ ์ธ ํ˜‘๋ ฅ ์บ์‹ฑ ๋ฐฉ์‹์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๋‘˜์งธ, ์ž…์ถœ๋ ฅ์‹œ ๋ฐœ์ƒํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋ณต์‚ฌ๋กœ ์ธํ•œ ์„ฑ๋Šฅ ๊ฐ„์„ญ ๋ฌธ์ œ๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ ์ธ ๋ฌด๋ณต์‚ฌ ์ฝ๊ธฐ ์ž…์ถœ๋ ฅ ๋ฐฉ์‹์„ ์ œ์‹œํ•˜์˜€๋‹ค. ์…‹์งธ, ์ธ-๋ฉ”๋ชจ๋ฆฌ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ์‹œ์Šคํ…œ์„ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ ์ธ fork ๊ธฐ๋ฐ˜ ์ฒดํฌํฌ์ธํŠธ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜์—ฌ ๊ธฐ์กด ํฌํฌ ๊ธฐ๋ฐ˜ ์ฒดํฌํฌ์ธํŠธ ๊ธฐ๋ฒ•์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ์ฆ๊ฐ€ ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜์˜€๋‹ค; ๊ธฐ์กด ๋ฐฉ์‹์€ ์—…๋ฐ์ดํŠธ ์ง‘์•ฝ์  ์›Œํฌ๋กœ๋“œ์— ๋Œ€ํ•ด ์ฒดํฌํฌ์ธํŒ…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋™์•ˆ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ์ตœ๋Œ€ 2๋ฐฐ๊นŒ์ง€ ์ ์ง„์ ์œผ๋กœ ์ฆ๊ฐ€ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•๋“ค์˜ ํšจ๊ณผ๋ฅผ ์ฆ๋ช…ํ•˜๊ธฐ ์œ„ํ•ด ์‹ค์ œ ๋ฉ€ํ‹ฐ ์ฝ”์–ด ์‹œ์Šคํ…œ์— ๊ตฌํ˜„ํ•˜๊ณ  ๊ทธ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜์˜€๋‹ค. ์‹คํ—˜๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด ์ œ์•ˆํ•œ ํ˜‘๋ ฅ์  ์ ‘๊ทผ๋ฐฉ์‹์ด ๊ธฐ์กด์˜ ๋น„ํ˜‘๋ ฅ์  ์ ‘๊ทผ๋ฐฉ์‹๋ณด๋‹ค ๋ฐ์ดํ„ฐ ์ง‘์•ฝ์  ์‘์šฉ์—๊ฒŒ ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ž์› ํ™œ์šฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ์œผ๋กœ์จ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.Chapter 1 Introduction 1 1.1 Motivation 1 1.1.1 Importance of Memory Resources 1 1.1.2 Problems 2 1.2 Contributions 5 1.3 Outline 6 Chapter 2 Background 7 2.1 Linux Kernel Memory Management 7 2.1.1 Page Cache 7 2.1.2 Page Reclamation 8 2.1.3 Page Table and TLB Shootdown 9 2.1.4 Copy-on-Write 10 2.2 Linux Support for Applications 11 2.2.1 fork 11 2.2.2 madvise 11 2.2.3 Direct I/O 12 2.2.4 mmap 13 Chapter 3 Memory Efficient Cooperative Caching 14 3.1 Motivation 14 3.1.1 Problems of Existing Datastore Architecture 14 3.1.2 Proposed Architecture 17 3.2 Related Work 17 3.3 Design and Implementation 19 3.3.1 Overview 19 3.3.2 Kernel Support 24 3.3.3 Migration to DBIO 25 3.4 Evaluation 27 3.4.1 System Configuration 27 3.4.2 Methodology 28 3.4.3 TPC-C Benchmarks 30 3.4.4 YCSB Benchmarks 32 3.5 Summary 37 Chapter 4 Memory Efficient Zero-copy I/O 38 4.1 Motivation 38 4.1.1 The Problems of Copy-Based I/O 38 4.2 Related Work 40 4.2.1 Zero Copy I/O 40 4.2.2 TLB Shootdown 42 4.2.3 Copy-on-Write 43 4.3 Design and Implementation 44 4.3.1 Prerequisites for z-READ 44 4.3.2 Overview of z-READ 45 4.3.3 TLB Shootdown Optimization 48 4.3.4 Copy-on-Write Optimization 52 4.3.5 Implementation 55 4.4 Evaluation 55 4.4.1 System Configurations 56 4.4.2 Effectiveness of the TLB Shootdown Optimization 57 4.4.3 Effectiveness of CoW Optimization 59 4.4.4 Analysis of the Performance Improvement 62 4.4.5 Performance Interference Intensity 63 4.4.6 Effectiveness of z-READ in Macrobenchmarks 65 4.5 Summary 67 Chapter 5 Memory Efficient Fork-based Checkpointing 69 5.1 Motivation 69 5.1.1 Fork-based Checkpointing 69 5.1.2 Approach 71 5.2 Related Work 73 5.3 Design and Implementation 74 5.3.1 Overview 74 5.3.2 OS Support 78 5.3.3 Implementation 79 5.4 Evaluation 80 5.4.1 Experimental Setup 80 5.4.2 Performance 81 5.5 Summary 86 Chapter 6 Conclusion 87 ์š”์•ฝ 100Docto

    Low-Overhead Migration of Read-Only and Read-Mostly Data for Adapting Applications to Hybrid Memory Systems

    Get PDF
    Memory systems containing different types of memory with varying capacity, latency, and bandwidth are rapidly becoming mainstream. Conventional memory management techniques do not suffice for these systems; they require alternative strategies to appropriately and effectively adapt application memory placement to these heterogeneous memory tiers. Software-based placement and movement strategies are the most desirable due to their flexibility and ease of adoption by end-users. However, there are substantial sources of overhead present when synchronizing low-level data movement with the operating system and running applications.This thesis proposes a novel method of reducing these memory movement overheads on hybrid memory systems. Many data objects are only written to early in their life cycle (i.e. shortly after allocation) and are effectively read-only after these initial writes. If this read-only and read-mostly data is duplicated across memory tiers, as opposed to moved, the application, in many cases, is able to avoid certain types of transfer overhead, such as page table entry (PTE) and MMU cache (TLB) synchronization stalls.This work describes the design and implementation of a kernel module, mtier that implements this optimization on memory that has been explicitly marked as read-only. Our evaluation demonstrates that this approach has the potential to substantially reduce data movement overheads, especially in applications that are multi-threaded and require frequent movement of data, allowing a flexible, software based approach for memory management in hybrid systems

    Architectural Support for Optimizing Huge Page Selection Within the OS

    Get PDF
    ยฉ 2023 Copyright held by the owner/author(s). This document is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/ This document is the Accepted version of a Published Work that appeared in final form in 56th ACM/IEEE International Symposium on Microarchitecture (MICRO), Toronto, Canada. To access the final edited and published work see https://doi.org/10.1145/3613424.3614296Irregular, memory-intensive applications often incur high translation lookaside buffer (TLB) miss rates that result in significant address translation overheads. Employing huge pages is an effective way to reduce these overheads, however in real systems the number of available huge pages can be limited when system memory is nearly full and/or fragmented. Thus, huge pages must be used selectively to back application memory. This work demonstrates that choosing memory regions that incur the most TLB misses for huge page promotion best reduces address translation overheads. We call these regions High reUse TLB-sensitive data (HUBs). Unlike prior work which relies on expensive per-page software counters to identify promotion regions, we propose new architectural support to identify these regions dynamically at application runtime. We propose a promotion candidate cache (PCC) that identifies HUB candidates based on hardware page table walks after a lastlevel TLB miss. This small, fixed-size structure tracks huge pagealigned regions (consisting of base pages), ranks them based on observed page table walk frequency, and only keeps the most frequently accessed ones. Evaluated on applications of various memory intensity, our approach successfully identifies application pages incurring the highest address translation overheads. Our approach demonstrates that with the help of a PCC, the OS only needs to promote 4% of the application footprint to achieve more than 75% of the peak achievable performance, yielding 1.19-1.33ร— speedups over 4KB base pages alone. In real systems where memory is typically fragmented, the PCC outperforms Linuxโ€™s page promotion policy by 14% (when 50% of total memory is fragmented) and 16% (when 90% of total memory is fragmented) respectively

    A Survey of Techniques for Architecting TLBs

    Get PDF
    โ€œTranslation lookaside bufferโ€ (TLB) caches virtual to physical address translation information and is used in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently and a TLB miss is extremely costly, prudent management of TLB is important for improving performance and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and distinctions. We believe that this paper will be useful for chip designers, computer architects and system engineers

    Doctor of Philosophy

    Get PDF
    dissertationIn recent years, a number of trends have started to emerge, both in microprocessor and application characteristics. As per Moore's law, the number of cores on chip will keep doubling every 18-24 months. International Technology Roadmap for Semiconductors (ITRS) reports that wires will continue to scale poorly, exacerbating the cost of on-chip communication. Cores will have to navigate an on-chip network to access data that may be scattered across many cache banks. The number of pins on the package, and hence available off-chip bandwidth, will at best increase at sublinear rate and at worst, stagnate. A number of disruptive memory technologies, e.g., phase change memory (PCM) have begun to emerge and will be integrated into the memory hierarchy sooner than later, leading to non-uniform memory access (NUMA) hierarchies. This will make the cost of accessing main memory even higher. In previous years, most of the focus has been on deciding the memory hierarchy level where data must be placed (L1 or L2 caches, main memory, disk, etc.). However, in modern and future generations, each level is getting bigger and its design is being subjected to a number of constraints (wire delays, power budget, etc.). It is becoming very important to make an intelligent decision about where data must be placed within a level. For example, in a large non-uniform access cache (NUCA), we must figure out the optimal bank. Similarly, in a multi-dual inline memory module (DIMM) non uniform memory access (NUMA) main memory, we must figure out the DIMM that is the optimal home for every data page. Studies have indicated that heterogeneous main memory hierarchies that incorporate multiple memory technologies are on the horizon. We must develop solutions for data management that take heterogeneity into account. For these memory organizations, we must again identify the appropriate home for data. In this dissertation, we attempt to verify the following thesis statement: "Can low-complexity hardware and OS mechanisms manage data placement within each memory hierarchy level to optimize metrics such as performance and/or throughput?" In this dissertation we argue for a hardware-software codesign approach to tackle the above mentioned problems at different levels of the memory hierarchy. The proposed methods utilize techniques like page coloring and shadow addresses and are able to handle a large number of problems ranging from managing wire-delays in large, shared NUCA caches to distributing shared capacity among different cores. We then examine data-placement issues in NUMA main memory for a many-core processor with a moderate number of on-chip memory controllers. Using codesign approaches, we achieve efficient data placement by modifying the operating system's (OS) page allocation algorithm for a wide variety of main memory architectures

    Snapshot: Fast, Userspace Crash Consistency for CXL and PM Using msync

    Full text link
    Crash consistency using persistent memory programming libraries requires programmers to use complex transactions and manual annotations. In contrast, the failure-atomic msync() (FAMS) interface is much simpler as it transparently tracks updates and guarantees that modified data is atomically durable on a call to the failure-atomic variant of msync(). However, FAMS suffers from several drawbacks, like the overhead of msync() and the write amplification from page-level dirty data tracking. To address these drawbacks while preserving the advantages of FAMS, we propose Snapshot, an efficient userspace implementation of FAMS. Snapshot uses compiler-based annotation to transparently track updates in userspace and syncs them with the backing byte-addressable storage copy on a call to msync(). By keeping a copy of application data in DRAM, Snapshot improves access latency. Moreover, with automatic tracking and syncing changes only on a call to msync(), Snapshot provides crash-consistency guarantees, unlike the POSIX msync() system call. For a KV-Store backed by Intel Optane running the YCSB benchmark, Snapshot achieves at least 1.2ร—\times speedup over PMDK while significantly outperforming conventional (non-crash-consistent) msync(). On an emulated CXL memory semantic SSD, Snapshot outperforms PMDK by up to 10.9ร—\times on all but one YCSB workload, where PMDK is 1.2ร—\times faster than Snapshot. Further, Kyoto Cabinet commits perform up to 8.0ร—\times faster with Snapshot than its built-in, msync()-based crash-consistency mechanism.Comment: A shorter version of this paper appeared in the Proceedings of ICCD 202

    Paving the Path for Heterogeneous Memory Adoption in Production Systems

    Full text link
    Systems from smartphones to data-centers to supercomputers are increasingly heterogeneous, comprising various memory technologies and core types. Heterogeneous memory systems provide an opportunity to suitably match varying memory access pat- terns in applications, reducing CPU time thus increasing performance per dollar resulting in aggregate savings of millions of dollars in large-scale systems. However, with increased provisioning of main memory capacity per machine and differences in memory characteristics (for example, bandwidth, latency, cost, and density), memory management in such heterogeneous memory systems poses multi-fold challenges on system programmability and design. In this thesis, we tackle memory management of two heterogeneous memory systems: (a) CPU-GPU systems with a unified virtual address space, and (b) Cloud computing platforms that can deploy cheaper but slower memory technologies alongside DRAMs to reduce cost of memory in data-centers. First, we show that operating systems do not have sufficient information to optimally manage pages in bandwidth-asymmetric systems and thus fail to maximize bandwidth to massively-threaded GPU applications sacrificing GPU throughput. We present BW-AWARE placement/migration policies to support OS to make optimal data management decisions. Second, we present a CPU-GPU cache coherence design where CPU and GPU need not implement same cache coherence protocol but provide cache-coherent memory interface to the programmer. Our proposal is first practical approach to provide a unified, coherent CPUโ€“GPU address space without requiring hardware cache coherence, with a potential to enable an explosion in algorithms that leverage tightly coupled CPUโ€“GPU coordination. Finally, to reduce the cost of memory in cloud platforms where the trend has been to map datasets in memory, we make a case for a two-tiered memory system where cheaper (per bit) memories, such as Intel/Microns 3D XPoint, will be deployed alongside DRAM. We present Thermostat, an application-transparent huge-page-aware software mechanism to place pages in a dual-technology hybrid memory system while achieving both the cost advantages of two-tiered memory and performance advantages of transparent huge pages. With Thermostatโ€™s capability to control the application slowdown on a per application basis, cloud providers can realize cost savings from upcoming cheaper memory technologies by shifting infrequently accessed cold data to slow memory, while satisfying throughput demand of the customers.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/137052/1/nehaag_1.pd

    Software-Oriented Data Access Characterization for Chip Multiprocessor Architecture Optimizations

    Get PDF
    The integration of an increasing amount of on-chip hardware in Chip-Multiprocessors (CMPs) poses a challenge of efficiently utilizing the on-chip resources to maximize performance. Prior research proposals largely rely on additional hardware support to achieve desirable tradeoffs. However, these purely hardware-oriented mechanisms typically result in more generic but less efficient approaches. A new trend is designing adaptive systems by exploiting and leveraging application-level information. In this work a wide range of applications are analyzed and remarkable data access behaviors/patterns are recognized to be useful for architectural and system optimizations. In particular, this dissertation work introduces software-based techniques that can be used to extract data access characteristics for cross-layer optimizations on performance and scalability. The collected information is utilized to guide cache data placement, network configuration, coherence operations, address translation, memory configuration, etc. In particular, an approach is proposed to classify data blocks into different categories to optimize an on-chip coherent cache organization. For applications with compile-time deterministic data access localities, a compiler technique is proposed to determine data partitions that guide the last level cache data placement and communication patterns for network configuration. A page-level data classification is also demonstrated to improve address translation performance. The successful utilization of data access characteristics on traditional CMP architectures demonstrates that the proposed approach is promising and generic and can be potentially applied to future CMP architectures with emerging technologies such as the Spin-transfer torque RAM (STT-RAM)
    corecore