361 research outputs found

    Reducing consistency traffic and cache misses in the avalanche multiprocessor

    Get PDF
    Journal ArticleFor a parallel architecture to scale effectively, communication latency between processors must be avoided. We have found that the source of a large number of avoidable cache misses is the use of hardwired write-invalidate coherency protocols, which often exhibit high cache miss rates due to excessive invalidations and subsequent reloading of shared data. In the Avalanche project at the University of Utah, we are building a 64-node multiprocessor designed to reduce the end-to-end communication latency of both shared memory and message passing programs. As part of our design efforts, we are evaluating the potential performance benefits and implementation complexity of providing hardware support for multiple coherency protocols. Using a detailed architecture simulation of Avalanche, we have found that support for multiple consistency protocols can reduce the time parallel applications spend stalled on memory operations by up to 66% and overall execution time by up to 31%. Most of this reduction in memory stall time is due to a novel release-consistent multiple-writer write-update protocol implemented using a write state buffer

    A Study of Client-based Caching for Parallel I/O

    Get PDF
    The trend in parallel computing toward large-scale cluster computers running thousands of cooperating processes per application has led to an I/O bottleneck that has only gotten more severe as the the number of processing cores per CPU has increased. Current parallel file systems are able to provide high bandwidth file access for large contiguous file region accesses; however, applications repeatedly accessing small file regions on unaligned file region boundaries continue to experience poor I/O throughput due to the high overhead associated with accessing parallel file system data. In this dissertation we demonstrate how client-side file data caching can improve parallel file system throughput for applications performing frequent small and unaligned file I/O. We explore the impacts of cache page size and cache capacity using the popular FLASH I/O benchmark and explore a novel cache sharing approach that leverages the trend toward multi-core processors. We also explore a technique we call progressive page caching that represents cache data using dynamic data structures rather than fixed-size pages of file data. Finally, we explore a cache aggregation scheme that leverages the high-level file I/O interfaces provided by the PVFS file system to provide further performance enhancements. In summary, our results indicate that a correctly configured middleware-based file data cache can dramatically improve the performance of I/O workloads dominated by small unaligned file accesses. Further, we demonstrate that a well designed cache can offer stable performance even when the selected cache page granularity is not well matched to the provided workload. Finally, we have shown that high-level file system interfaces can significantly accelerate application performance, and interfaces beyond those currently envisioned by the MPI-IO standard could provide further performance benefits

    Improving Parallel I/O Performance Using Interval I/O

    Get PDF
    Today\u27s most advanced scientific applications run on large clusters consisting of hundreds of thousands of processing cores, access state of the art parallel file systems that allow files to be distributed across hundreds of storage targets, and utilize advanced interconnections systems that allow for theoretical I/O bandwidth of hundreds of gigabytes per second. Despite these advanced technologies, these applications often fail to obtain a reasonable proportion of available I/O bandwidth. The reasons for the poor performance of application I/O include the noncontiguous I/O access patterns used for scientific computing, contention due to false sharing, and the somewhat finicky nature of parallel file system performance. We argue that a more fundamental cause of this problem is the legacy view of a file as a linear sequence of bytes. To address these issues, we introduce a novel approach for parallel I/O called Interval I/O. Interval I/O is an innovative approach that uses application access patterns to partition a file into a series of intervals, which are used as the fundamental unit for subsequent I/O operations. Use of this approach provides superior performance for the noncontiguous access patterns which are frequently used by scientific applications. In addition, the approach reduces false contention and the unnecessary serialization it causes. Interval I/O also significantly increases the performance of atomic mode operations. Finally, the Interval I/O approach includes a technique for supporting parallel I/O for cooperating applications. We provide a prototype implementation of our Interval I/O system and use it to demonstrate performance improvements of as much as 1000% compared to ROMIO when using Interval I/O with several common benchmarks

    데이터 집약적 μ‘μš©μ˜ 효율적인 μ‹œμŠ€ν…œ μžμ› ν™œμš©μ„ μœ„ν•œ λ©”λͺ¨λ¦¬ μ„œλΈŒμ‹œμŠ€ν…œ μ΅œμ ν™”

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사) -- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·컴퓨터곡학뢀, 2020. 8. μ—Όν—Œμ˜.With explosive data growth, data-intensive applications, such as relational database and key-value storage, have been increasingly popular in a variety of domains in recent years. To meet the growing performance demands of data-intensive applications, it is crucial to efficiently and fully utilize memory resources for the best possible performance. However, general-purpose operating systems (OSs) are designed to provide system resources to applications running on a system in a fair manner at system-level. A single application may find it difficult to fully exploit the systems best performance due to this system-level fairness. For performance reasons, many data-intensive applications implement their own mechanisms that OSs already provide, under the assumption that they know better about the data than OSs. They can be greedily optimized for performance but this may result in inefficient use of system resources. In this dissertation, we claim that simple OS support with minor application modifications can yield even higher application performance without sacrificing system-level resource utilization. We optimize and extend OS memory subsystem for better supporting applications while addressing three memory-related issues in data-intensive applications. First, we introduce a memory-efficient cooperative caching approach between application and kernel buffer to address double caching problem where the same data resides in multiple layers. Second, we present a memory-efficient, transparent zero-copy read I/O scheme to avoid the performance interference problem caused by memory copy behavior during I/O. Third, we propose a memory-efficient fork-based checkpointing mechanism for in-memory database systems to mitigate the memory footprint problem of the existing fork-based checkpointing scheme; memory usage increases incrementally (up to 2x) during checkpointing for update-intensive workloads. To show the effectiveness of our approach, we implement and evaluate our schemes on real multi-core systems. The experimental results demonstrate that our cooperative approach can more effectively address the above issues related to data-intensive applications than existing non-cooperative approaches while delivering better performance (in terms of transaction processing speed, I/O throughput, or memory footprint).졜근 폭발적인 데이터 μ„±μž₯κ³Ό λ”λΆˆμ–΄ λ°μ΄ν„°λ² μ΄μŠ€, ν‚€-λ°Έλ₯˜ μŠ€ν† λ¦¬μ§€ λ“±μ˜ 데이터 집약적인 μ‘μš©λ“€μ΄ λ‹€μ–‘ν•œ λ„λ©”μΈμ—μ„œ 인기λ₯Ό μ–»κ³  μžˆλ‹€. 데이터 집약적인 μ‘μš©μ˜ 높은 μ„±λŠ₯ μš”κ΅¬λ₯Ό μΆ©μ‘±ν•˜κΈ° μœ„ν•΄μ„œλŠ” 주어진 λ©”λͺ¨λ¦¬ μžμ›μ„ 효율적이고 μ™„λ²½ν•˜κ²Œ ν™œμš©ν•˜λŠ” 것이 μ€‘μš”ν•˜λ‹€. κ·ΈλŸ¬λ‚˜, λ²”μš© 운영체제(OS)λŠ” μ‹œμŠ€ν…œμ—μ„œ μˆ˜ν–‰ 쀑인 λͺ¨λ“  μ‘μš©λ“€μ— λŒ€ν•΄ μ‹œμŠ€ν…œ μ°¨μ›μ—μ„œ κ³΅ν‰ν•˜κ²Œ μžμ›μ„ μ œκ³΅ν•˜λŠ” 것을 μš°μ„ ν•˜λ„λ‘ μ„€κ³„λ˜μ–΄μžˆλ‹€. 즉, μ‹œμŠ€ν…œ μ°¨μ›μ˜ 곡평성 μœ μ§€λ₯Ό μœ„ν•œ 운영체제 μ§€μ›μ˜ ν•œκ³„λ‘œ 인해 단일 μ‘μš©μ€ μ‹œμŠ€ν…œμ˜ 졜고 μ„±λŠ₯을 μ™„μ „νžˆ ν™œμš©ν•˜κΈ° μ–΄λ ΅λ‹€. μ΄λŸ¬ν•œ 이유둜, λ§Žμ€ 데이터 집약적 μ‘μš©μ€ μš΄μ˜μ²΄μ œμ—μ„œ μ œκ³΅ν•˜λŠ” κΈ°λŠ₯에 μ˜μ§€ν•˜μ§€ μ•Šκ³  λΉ„μŠ·ν•œ κΈ°λŠ₯을 μ‘μš© λ ˆλ²¨μ— κ΅¬ν˜„ν•˜κ³€ ν•œλ‹€. μ΄λŸ¬ν•œ μ ‘κ·Ό 방법은 νƒμš•μ μΈ μ΅œμ ν™”κ°€ κ°€λŠ₯ν•˜λ‹€λŠ” μ μ—μ„œ μ„±λŠ₯ 상 이득이 μžˆμ„ 수 μžˆμ§€λ§Œ, μ‹œμŠ€ν…œ μžμ›μ˜ λΉ„νš¨μœ¨μ μΈ μ‚¬μš©μ„ μ΄ˆλž˜ν•  수 μžˆλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” 운영체제의 지원과 μ•½κ°„μ˜ μ‘μš© μˆ˜μ •λ§ŒμœΌλ‘œλ„ λΉ„νš¨μœ¨μ μΈ μ‹œμŠ€ν…œ μžμ› μ‚¬μš© 없이 보닀 높은 μ‘μš© μ„±λŠ₯을 보일 수 μžˆμŒμ„ 증λͺ…ν•˜κ³ μž ν•œλ‹€. 그러기 μœ„ν•΄ 운영체제의 λ©”λͺ¨λ¦¬ μ„œλΈŒμ‹œμŠ€ν…œμ„ μ΅œμ ν™” 및 ν™•μž₯ν•˜μ—¬ 데이터 집약적인 μ‘μš©μ—μ„œ λ°œμƒν•˜λŠ” μ„Έ 가지 λ©”λͺ¨λ¦¬ κ΄€λ ¨ 문제λ₯Ό ν•΄κ²°ν•˜μ˜€λ‹€. 첫째, λ™μΌν•œ 데이터가 μ—¬λŸ¬ 계측에 μ‘΄μž¬ν•˜λŠ” 쀑볡 캐싱 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ μ‘μš©κ³Ό 컀널 버퍼 간에 λ©”λͺ¨λ¦¬ 효율적인 ν˜‘λ ₯ 캐싱 방식을 μ œμ‹œν•˜μ˜€λ‹€. λ‘˜μ§Έ, μž…μΆœλ ₯μ‹œ λ°œμƒν•˜λŠ” λ©”λͺ¨λ¦¬ λ³΅μ‚¬λ‘œ μΈν•œ μ„±λŠ₯ κ°„μ„­ 문제λ₯Ό ν”Όν•˜κΈ° μœ„ν•΄ λ©”λͺ¨λ¦¬ 효율적인 무볡사 읽기 μž…μΆœλ ₯ 방식을 μ œμ‹œν•˜μ˜€λ‹€. μ…‹μ§Έ, 인-λ©”λͺ¨λ¦¬ λ°μ΄ν„°λ² μ΄μŠ€ μ‹œμŠ€ν…œμ„ μœ„ν•œ λ©”λͺ¨λ¦¬ 효율적인 fork 기반 체크포인트 기법을 μ œμ•ˆν•˜μ—¬ κΈ°μ‘΄ 포크 기반 체크포인트 κΈ°λ²•μ—μ„œ λ°œμƒν•˜λŠ” λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰ 증가 문제λ₯Ό μ™„ν™”ν•˜μ˜€λ‹€; κΈ°μ‘΄ 방식은 μ—…λ°μ΄νŠΈ 집약적 μ›Œν¬λ‘œλ“œμ— λŒ€ν•΄ μ²΄ν¬ν¬μΈνŒ…μ„ μˆ˜ν–‰ν•˜λŠ” λ™μ•ˆ λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰μ΄ μ΅œλŒ€ 2λ°°κΉŒμ§€ μ μ§„μ μœΌλ‘œ 증가할 수 μžˆμ—ˆλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” μ œμ•ˆν•œ λ°©λ²•λ“€μ˜ 효과λ₯Ό 증λͺ…ν•˜κΈ° μœ„ν•΄ μ‹€μ œ λ©€ν‹° μ½”μ–΄ μ‹œμŠ€ν…œμ— κ΅¬ν˜„ν•˜κ³  κ·Έ μ„±λŠ₯을 ν‰κ°€ν•˜μ˜€λ‹€. μ‹€ν—˜κ²°κ³Όλ₯Ό 톡해 μ œμ•ˆν•œ ν˜‘λ ₯적 접근방식이 기쑴의 λΉ„ν˜‘λ ₯적 접근방식보닀 데이터 집약적 μ‘μš©μ—κ²Œ 효율적인 λ©”λͺ¨λ¦¬ μžμ› ν™œμš©μ„ κ°€λŠ₯ν•˜κ²Œ ν•¨μœΌλ‘œμ¨ 더 높은 μ„±λŠ₯을 μ œκ³΅ν•  수 μžˆμŒμ„ 확인할 수 μžˆμ—ˆλ‹€.Chapter 1 Introduction 1 1.1 Motivation 1 1.1.1 Importance of Memory Resources 1 1.1.2 Problems 2 1.2 Contributions 5 1.3 Outline 6 Chapter 2 Background 7 2.1 Linux Kernel Memory Management 7 2.1.1 Page Cache 7 2.1.2 Page Reclamation 8 2.1.3 Page Table and TLB Shootdown 9 2.1.4 Copy-on-Write 10 2.2 Linux Support for Applications 11 2.2.1 fork 11 2.2.2 madvise 11 2.2.3 Direct I/O 12 2.2.4 mmap 13 Chapter 3 Memory Efficient Cooperative Caching 14 3.1 Motivation 14 3.1.1 Problems of Existing Datastore Architecture 14 3.1.2 Proposed Architecture 17 3.2 Related Work 17 3.3 Design and Implementation 19 3.3.1 Overview 19 3.3.2 Kernel Support 24 3.3.3 Migration to DBIO 25 3.4 Evaluation 27 3.4.1 System Configuration 27 3.4.2 Methodology 28 3.4.3 TPC-C Benchmarks 30 3.4.4 YCSB Benchmarks 32 3.5 Summary 37 Chapter 4 Memory Efficient Zero-copy I/O 38 4.1 Motivation 38 4.1.1 The Problems of Copy-Based I/O 38 4.2 Related Work 40 4.2.1 Zero Copy I/O 40 4.2.2 TLB Shootdown 42 4.2.3 Copy-on-Write 43 4.3 Design and Implementation 44 4.3.1 Prerequisites for z-READ 44 4.3.2 Overview of z-READ 45 4.3.3 TLB Shootdown Optimization 48 4.3.4 Copy-on-Write Optimization 52 4.3.5 Implementation 55 4.4 Evaluation 55 4.4.1 System Configurations 56 4.4.2 Effectiveness of the TLB Shootdown Optimization 57 4.4.3 Effectiveness of CoW Optimization 59 4.4.4 Analysis of the Performance Improvement 62 4.4.5 Performance Interference Intensity 63 4.4.6 Effectiveness of z-READ in Macrobenchmarks 65 4.5 Summary 67 Chapter 5 Memory Efficient Fork-based Checkpointing 69 5.1 Motivation 69 5.1.1 Fork-based Checkpointing 69 5.1.2 Approach 71 5.2 Related Work 73 5.3 Design and Implementation 74 5.3.1 Overview 74 5.3.2 OS Support 78 5.3.3 Implementation 79 5.4 Evaluation 80 5.4.1 Experimental Setup 80 5.4.2 Performance 81 5.5 Summary 86 Chapter 6 Conclusion 87 μš”μ•½ 100Docto

    Avalanche: A communication and memory architecture for scalable parallel computing

    Get PDF
    technical reportAs the gap between processor and memory speeds widens, system designers will inevitably incorporate increasingly deep memory hierarchies to maintain the balance between processor and memory system performance. At the same time, most communication subsystems are permitted access only to main memory and not a processor's top level cache. As memory latencies increase, this lack of integration between the memory and communication systems will seriously impede interprocessor communication performance and limit effective scalability. In the Avalanche project we are redesigning the memory architecture of a commercial RISC multiprocessor, the HP PA-RISC 7100, to include a new multi-level context sensitive cache that is tightly coupled to the communication fabric. The primary goal of Avalanche's integrated cache and communication controller is attacking end to end communication latency in all of its forms. This includes cache misses induced by excessive invalidations and reloading of shared data by write-invalidate coherence protocols and cache misses induced by depositing incoming message data in main memory and faulting it into the cache. An execution-driven simulation study of Avalanche's architecture indicates that it can reduce cache stalls by 5-60% and overall execution times by 10-28%

    2OS

    Get PDF
    In this book I approach the problem of understanding an OS from the point of view of a C programmer who needs to understand enough of how an OS works to program efficiently and avoid traps and pitfalls arising from not understanding what is happening underneath you. If you have a deep understanding of the memory system, you will not program in a style that loses significant performance by breaking the assumptions of the OS designer. If you have an understanding of how IO works, you can make good use of OS services. As you work through this book you will see other examples

    Towards Successful Application of Phase Change Memories: Addressing Challenges from Write Operations

    Get PDF
    The emerging Phase Change Memory (PCM) technology is drawing increasing attention due to its advantages in non-volatility, byte-addressability and scalability. It is regarded as a promising candidate for future main memory. However, PCM's write operation has some limitations that pose challenges to its application in memory. The disadvantages include long write latency, high write power and limited write endurance. In this thesis, I present my effort towards successful application of PCM memory. My research consists of several optimizing techniques at both the circuit and architecture level. First, at the circuit level, I propose Differential Write to remove unnecessary bit changes in PCM writes. This is not only beneficial to endurance but also to the energy and latency of writes. Second, I propose two memory scheduling enhancements (AWP and RAWP) for a non-blocking bank design. My memory scheduling enhancements can exploit intra-bank parallelism provided by non-blocking bank design, and achieve significant throughput improvement. Third, I propose Bit Level Power Budgeting (BPB), a fine-grained power budgeting technique that leverages the information from Differential Write to achieve even higher memory throughput under the same power budget. Fourth, I propose techniques to improve the QoS tuning ability of high-priority applications when running on PCM memory. In summary, the techniques I propose effectively address the challenges of PCM's write operations. In addition, I present the experimental infrastructure in this work and my visions of potential future research topics, which could be helpful to other researchers in the area

    Avalanche: A communication and memory architecture for scalable parallel computing

    Get PDF
    technical reportAs the gap between processor and memory speeds widens?? system designers will inevitably incorpo rate increasingly deep memory hierarchies to maintain the balance between processor and memory system performance At the same time?? most communication subsystems are permitted access only to main memory and not a processor s top level cache As memory latencies increase?? this lack of integration between the memory and communication systems will seriously impede interprocessor communication performance and limit e ective scalability In the Avalanche project we are re designing the memory architecture of a commercial RISC multiprocessor?? the HP PA RISC ?? to include a new multi level context sensitive cache that is tightly coupled to the communication fabric The primary goal of Avalanche s integrated cache and communication controller is attack ing end to end communication latency in all of its forms This includes cache misses induced by excessive invalidations and reloading of shared data by write invalidate coherence protocols and cache misses induced by depositing incoming message data in main memory and faulting it into the cache An execution driven simulation study of Avalanche s architecture indicates that it can reduce cache stalls by and overall execution times b

    Rethinking the I/O Stack for Persistent Memory

    Get PDF
    Modern operating systems have been designed around the hypotheses that (a) memory is both byte-addressable and volatile and (b) storage is block addressable and persistent. The arrival of new Persistent Memory (PM) technologies, has made these assumptions obsolete. Despite much of the recent work in this space, the need for consistently sharing PM data across multiple applications remains an urgent, unsolved problem. Furthermore, the availability of simple yet powerful operating system support remains elusive. In this dissertation, we propose and build The Region System – a high-performance operating system stack for PM that implements usable consistency and persistence for application data. The region system provides support for consistently mapping and sharing data resident in PM across user application address spaces. The region system creates a novel IPI based PMSYNC operation, which ensures atomic persistence of mapped pages across multiple address spaces. This allows applications to consume PM using the well understood and much desired memory like model with an easy-to-use interface. Next, we propose a metadata structure without any redundant metadata to reduce CPU cache flushes. The high-performance design minimizes the expensive PM ordering and durability operations by embracing a minimalistic approach to metadata construction and management. To strengthen the case for the region system, in this dissertation, we analyze different types of applications to identify their dependence on memory mapped data usage, and propose user level libraries LIBPM-R and LIBPMEMOBJ-R to support shared persistent containers. The user level libraries along with the region system demonstrate a comprehensive end-to-end software stack for consuming the PM devices
    • …
    corecore