12 research outputs found

    Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems

    Get PDF
    Abstract In container-based virtualization where multiple isolated containers share I/O resources on top of a single operating system, efficient and proportional I/O resource sharing is an important system requirement. Motivated by a lack of adequate support for I/O resource sharing in Linux Cgroup for high-performance NVMe SSDs, we developed a new weight-based dynamic throttling technique which can provide proportional I/O sharing for container-based virtualization solutions running on NUMA multi-core systems with NVMe SSDs. By intelligently predicting the future I/O bandwidth requirement of containers based on past I/O service rates of I/O-active containers, and modifying the current Linux Cgroup implementation for better NUMAscalable performance, our scheme achieves highly accurate I/O resource sharing while reducing wasted I/O bandwidth. Based on a Linux kernel 4.0.4 implementation running on a 4-node NUMA multi-core systems with NVMe SSDs, our experimental results show that the proposed technique can efficiently share the I/O bandwidth of NVMe SSDs among multiple containers according to given I/O weights

    TACKLING PERFORMANCE AND SECURITY ISSUES FOR CLOUD STORAGE SYSTEMS

    Get PDF
    Building data-intensive applications and emerging computing paradigm (e.g., Machine Learning (ML), Artificial Intelligence (AI), Internet of Things (IoT) in cloud computing environments is becoming a norm, given the many advantages in scalability, reliability, security and performance. However, under rapid changes in applications, system middleware and underlying storage device, service providers are facing new challenges to deliver performance and security isolation in the context of shared resources among multiple tenants. The gap between the decades-old storage abstraction and modern storage device keeps widening, calling for software/hardware co-designs to approach more effective performance and security protocols. This dissertation rethinks the storage subsystem from device-level to system-level and proposes new designs at different levels to tackle performance and security issues for cloud storage systems. In the first part, we present an event-based SSD (Solid State Drive) simulator that models modern protocols, firmware and storage backend in detail. The proposed simulator can capture the nuances of SSD internal states under various I/O workloads, which help researchers understand the impact of various SSD designs and workload characteristics on end-to-end performance. In the second part, we study the security challenges of shared in-storage computing infrastructures. Many cloud providers offer isolation at multiple levels to secure data and instance, however, security measures in emerging in-storage computing infrastructures are not studied. We first investigate the attacks that could be conducted by offloaded in-storage programs in a multi-tenancy cloud environment. To defend against these attacks, we build a lightweight Trusted Execution Environment, IceClave to enable security isolation between in-storage programs and internal flash management functions. We show that while enforcing security isolation in the SSD controller with minimal hardware cost, IceClave still keeps the performance benefit of in-storage computing by delivering up to 2.4x better performance than the conventional host-based trusted computing approach. In the third part, we investigate the performance interference problem caused by other tenants' I/O flows. We demonstrate that I/O resource sharing can often lead to performance degradation and instability. The block device abstraction fails to expose SSD parallelism and pass application requirements. To this end, we propose a software/hardware co-design to enforce performance isolation by bridging the semantic gap. Our design can significantly improve QoS (Quality of Service) by reducing throughput penalties and tail latency spikes. Lastly, we explore more effective I/O control to address contention in the storage software stack. We illustrate that the state-of-the-art resource control mechanism, Linux cgroups is insufficient for controlling I/O resources. Inappropriate cgroup configurations may even hurt the performance of co-located workloads under memory intensive scenarios. We add kernel support for limiting page cache usage per cgroup and achieving I/O proportionality

    I/O Schedulers for Proportionality and Stability on Flash-Based SSDs in Multi-Tenant Environments

    Get PDF
    The use of flash based Solid State Drives (SSDs) has expanded rapidly into the cloud computing environment. In cloud computing, ensuring the service level objective (SLO) of each server is the major criterion in designing a system. In particular, eliminating performance interference among virtual machines (VMs) on shared storage is a key challenge. However, studies on SSD performance to guarantee SLO in such environments are limited. In this paper, we present analysis of I/O behavior for a shared SSD as storage in terms of proportionality and stability. We show that performance SLOs of SSD based storage systems being shared by VMs or tasks are not satisfactory. We present and analyze the reasons behind the unexpected behavior through examining the components of SSDs such as channels, DRAM buffer, and Native Command Queuing (NCQ). We introduce two novel SSD-aware host level I/O schedulers on Linux, called A & x002B;CFQ and H & x002B;BFQ, based on our analysis and findings. Through experiments on Linux, we analyze I/O proportionality and stability in multi-tenant environments. In addition, through experiments using real workloads, we analyze the performance interference between workloads on a shared SSD. We then show that the proposed I/O schedulers almost eliminate the interference effect seen in CFQ and BFQ, while still providing I/O proportionality and stability for various I/O weighted scenarios

    플래시 기반의 고성능 컴퓨팅 스토리지 시스템을 위한 효율적인 입출력 관리 기법

    Get PDF
    학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2020. 8. 엄현상.Most I/O traffic in high performance computing (HPC) storage systems is dominated by checkpoints and the restarts of HPC applications. For such a bursty I/O, new all-flash HPC storage systems with an integrated burst buffer (BB) and parallel file system (PFS) have been proposed. However, most of the distributed file systems (DFS) used to configure the storage systems provide a single connection between a compute node and a server node, which hinders users from utilizing the high I/O bandwidth provided by an all-flash server node. To provide multiple connections, DFSs must be modified to increase the number of sockets, which is an extremely difficult and time-consuming task owing to their complicated structures. Users can increase the number of daemons in the DFSs to forcibly increase the number of connections without a DFS modification. Because each daemon has a mount point for its connection, there are multiple mount points in the compute nodes, resulting in significant effort required for users to distribute file I/O requests to multiple mount points. In addition, to avoid access to a PFS composed of low-speed storage devices, such as hard disks, dedicated BB allocation is preferred despite its severe underutilization. However, a BB allocation method may be inappropriate because all-flash HPC storage systems speed up access to the PFS. To handle such problems, we propose an efficient user-transparent I/O management scheme for all-flash HPC storage systems. The first scheme, I/O transfer management, provides multiple connections between a compute node and a server node without additional effort from DFS developers and users. To do so, we modified a mount procedure and I/O processing procedures in a virtual file system (VFS). In the second scheme, data management between BB and PFS, a BB over-subscription allocation method is adopted to improve the BB utilization. Unfortunately, the allocation method aggravates the I/O interference and demotion overhead from the BB to the PFS, resulting in a degraded checkpoint and restart performance. To minimize this degradation, we developed an I/O scheduler and a new data management based on the checkpoint and restart characteristics. To prove the effectiveness of our proposed schemes, we evaluated our I/O transfer and data management schemes between the BB and PFS. The I/O transfer management scheme improves the write and read I/O throughputs for the checkpoint and restart by up to 6- and 3-times, that of a DFS using the original kernel, respectively. Based on the data management scheme, we found that the BB utilization is improved by at least 2.2-fold, and a stabler and higher checkpoint performance is guaranteed. In addition, we achieved up to a 96.4\% hit ratio of the restart requests on the BB and up to a 3.1-times higher restart performance than that of other existing methods.고성능 컴퓨팅 스토리지 시스템의 입출력 대역폭의 대부분은 고성능 어플리케이션의 체크포인트와 재시작이 차지하고 있다. 이런 고성능 어플리케이션의 폭발적인 입출력을 원활하게 처리하게 위하여, 고급 플래시 저장 장치와 저급 플래시 저장 장치를 이용하여 버스트 버퍼와 PFS를 합친 새로운 플래시 기반의 고성능 컴퓨팅 스토리지 시스템이 제안되었다. 하지만 스토리지 시스템을 구성하기 위하여 사용되는 대부분의 분산 파일 시스템들은 노드간 하나의 네트워크 연결을 제공하고 있어 서버 노드에서 제공할 수 있는 높은 플래시들의 입출력 대역폭을 활용하지 못한다. 여러개의 네트워크 연결을 제공하기 위해서는 분산 파일 시스템이 수정되어야 하거나, 분산 파일 시스템의 클라이언트 데몬과 서버 데몬의 갯수를 증가시키는 방법이 사용되어야 한다. 하지만, 분산 파일 시스템은 매우 복잡한 구조로 구성되어 있기 때문에 많은 시간과 노력이 분산 파일 시스템 개발자들에게 요구된다. 데몬의 갯수를 증가시키는 방법은 각 네트워크 커넥션마다 새로운 마운트 포인트가 존재하기 때문에, 직접 파일 입출력 리퀘스트를 여러 마운트 포인트로 분산시켜야 하는 엄청난 노력이 사용자에게 요구된다. 서버 데몬의 개수를 증가시켜 네트워크 커넥션의 수를 증가시킬 경우엔, 서버 데몬이 서로 다른 파일 시스템 디렉토리 관점을 갖기 때문에 사용자가 직접 서로 다른 서버 데몬을 인식하고 데이터 충돌이 일어나지 않도록 주의해야 한다. 게다가, 기존에는 사용자들이 하드디스크와 같은 저속 저장 장치로 구성된 PFS로의 접근을 피하기 위하여, 버스트 버퍼의 효율성을 포기하면서도 전용 버스트 버퍼 할당 방식 (Dedicated BB allocation method)을 선호했다. 하지만 새로운 플래시 기반의 고성능 컴퓨팅 스토리지 시스템에서는 병렬 파일 시스템으로의 접근이 빠르기때문에, 해당 버스트 버퍼 할당 방식을 사용하는것은 적절치 않다. 이런 문제들을 해결하기 위하여, 본 논문에서 사용자에게 내부 처리과정이 노출 되지않는 새로운 플래시 기반의 고성능 스토리지 시스템을 위한 효율적인 데이터 기법들을 소개한다. 첫번째 기법인 입출력 전송 관리 기법은 분산 파일 시스템 개발자와 사용자들의 추가적인 노력없이 컴퓨트 노드와 서버 노드 사이에 여러개의 커넥션을 제공한다. 이를 위해, 가상 파일 시스템의 마운트 수행 과정과 입출력 처리 과정을 수정하였다. 두번째 기법인 데이터 관리 기법에서는 버스트 버퍼의 활용률을 향상 시키기 위하여 버스트 버퍼 초과 할당 기법 (BB over-subscription method)을 사용한다. 하지만, 해당 할당 방식은 사용자 간의 입출력 경합과 디모션 오버헤드를 발생하기때문에 낮은 체크포인트와 재시작 성능을 제공한다. 이를 방지하기 위하여, 체크포인트와 재시작의 특성을 기반으로 버스트 버퍼와 병렬 파일 시스템의 데이터를 관리한다. 본 논문에서는 제안한 방법들의 효과를 증명하기 위하여 실제 플래시 기반의 스토리지 시스템을 구축하고 제안한 방법들을 적용하여 성능을 평가했다. 실험을 통해 입출력 전송 관리 기법이 기존 기법보다 최대 6배 그리고 최대 2배 높은 쓰기 그리고 읽기 입출력 성능을 제공했다. 데이터 관리 기법은 기존 방법에 비해, 버스트 버퍼 활용률을 2.2배 향상 시켰다. 게다가 높고 안정적인 체크포인트 성능을 보였으며 최대 3.1배 높은 재시작 성능을 제공했다.Chapter 1 Introduction 1 Chapter 2 Background 11 2.1 Burst Buffer 11 2.2 Virtual File System 13 2.3 Network Bandwidth 14 2.4 Mean Time Between Failures 16 2.5 Checkpoint/Restart Characteristics 17 Chapter 3 Motivation 19 3.1 I/O Transfer Management for HPC Storage Systems 19 3.1.1 Problems of Existing HPC Storage Systems 19 3.1.2 Limitations of Existing Approaches 23 3.2 Data Management for HPC Storage Systems 26 3.2.1 Problems of Existing HPC Storage Systems 26 3.2.2 Limitations with Existing Approaches 27 Chapter 4 Mulconn: User-Transparent I/O Transfer Management for HPC Storage Systems 31 4.1 Design and Architecture 31 4.1.1 Overview 31 4.1.2 Scale Up Connections 34 4.1.3 I/O Scheduling 36 4.1.4 Automatic Policy Decision 38 4.2 Implementation 41 4.2.1 File Open and Close 41 4.2.2 File Write and Read 45 4.3 Evaluation. 46 4.3.1 Experimental Environment 46 4.3.2 I/O Throughputs Improvement 46 4.3.3 Comparison between TtoS and TtoM 59 4.3.4 Effectiveness of Our System 60 4.4 Summary 63 Chapter 5 BBOS: User-Transparent Data Management for HPC Storage Systems 64 5.1 Design and Architecture 64 5.1.1 Overview 64 5.1.2 DataManagementEngine 66 5.2 Implementation 72 5.2.1 In-memory Key-value Store 72 5.2.2 I/O Engine 72 5.2.3 Data Management Engine 75 5.2.4 Stable Checkpoint and Demotion Performance 77 5.3 Evaluation 78 5.3.1 Experimental Environment 78 5.3.2 Burst Buffer Utilization 81 5.3.3 Checkpoint Performance 82 5.3.4 Restart Performance 86 5.4 Summary 90 Chapter 6 Related Work 91 Chapter 7 Conclusion 94 요약 105 감사의 글 107Docto

    Uma análise de desempenho de motores de armazenamento chave-valor para ambientes com recursos de armazenamento persistente compartilhados

    Get PDF
    Orientador: Marcos Sfair SunyeTese (doutorado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa : Curitiba, 31/05/2023Inclui referências: p. 92-98Área de concentração: Ciência da ComputaçãoResumo: Os motores de armazenamento chave-valor possuem atualmente um status importante em diversos segmentos de tecnologia da informação, tendo sua aplicabilidade comprovada desde sistemas minimalistas de Internet das Coisas e dispositivos móveis até grandes e complexas aplicações científicas e de Big Data. Essenciais para a persistência de dados, as tecnologias de armazenamento também obtiveram avanços substanciais ao longo dos últimos anos com a substituição progressiva dos tradicionais discos rígidos por dispositivos baseados em memória flash em grande parte das infraestruturas de computação em nuvem públicas e privadas. Apesar de sua adoção crescente e alto desempenho, os dispositivos baseados em memória flash são muito mais complexos que seus predecessores, empregando técnicas avançadas de mapeamento e organização interna dos dados, cujos detalhes de implementação não são divulgados pela maioria dos fabricantes neste segmento. Tal complexidade e ausência de informações importantes sobre o funcionamento interno desses dispositivos não somente prejudicam a avaliação de desempenho de motores de armazenamento chave-valor, mas também tornam esta tarefa ainda mais complexa em cenários onde este tipo de recurso de armazenamento é compartilhado por outras cargas de trabalho. Uma vez que tais recursos são cada vez mais compartilhados em ambientes de computação em nuvem, a avaliação de motores de armazenamento chave-valor em tais condições é tão desafiadora quanto necessária. Este estudo propõe o Storiks, umframework desenvolvido para avaliar como o desempenho de um motor de armazenamento chave-valor é afetado por cargas de trabalho concorrentes compartilhando um mesmo dispositivo baseado em memória flash. A partir deste framework, este estudo avaliou experimentalmente 840 combinações de cargas de trabalho, dispositivos de armazenamento e versões de sistema operacional. Os resultados obtidos por esses experimentos demonstram que esta interferência de desempenho pode assumir padrões distintos de acordo com cada um desses fatores, variando desde patamares próximos a nenhuma interferência até condições altamente degradantes, onde o desempenho do motor de armazenamento chave-valor é reduzido em mais de 90%.Abstract: Key-value stores have today an important status in numerous information technology segments, ranging from the minimalist world of mobile devices and Internet of Things to highly complex scientific applications and Big Data. Essential to persist data, storage technology has also shown substantial improvements over the recent years with the progressive replacement of old hard disk drives by flash-based storage devices in most private and public cloud computing infrastructures. Despite the increased adoption and superior performance, flash-based storage devices usually comprise complex hardware and firmware logic with many details not revealed by their respective manufacturers. This complexity and lack of more information about their internal behavior not only hinder a proper performance evaluation of key-value stores but also turn this task even more complex in scenarios where such resources are shared with other co-located workloads. Once sharing resources is a concept increasingly used in cloud environments, evaluating key-value stores in such a context is challenging but necessary. This study proposes Storiks, a framework designed to assess how the key-value store's performance is affected by concurrent workloads when sharing the same flash-based storage device. Using this framework, we experimentally evaluated 840 combinations of different workloads, storage devices, and operating system versions. We demonstrate from these results that such interference may assume distinct patterns according to each of these factors, ranging from approximately no interference to highly degradation conditions, where the key-value store's performance is reduced by more than 90%

    IMPROVING THE PERFORMANCE OF HYBRID MAIN MEMORY THROUGH SYSTEM AWARE MANAGEMENT OF HETEROGENEOUS RESOURCES

    Get PDF
    Modern computer systems feature memory hierarchies which typically include DRAM as the main memory and HDD as the secondary storage. DRAM and HDD have been extensively used for the past several decades because of their high performance and low cost per bit at their level of hierarchy. Unfortunately, DRAM is facing serious scaling and power consumption problems, while HDD has suffered from stagnant performance improvement and poor energy efficiency. After all, computer system architects have an implicit consensus that there is no hope to improve future system’s performance and power consumption unless something fundamentally changes. To address the looming problems with DRAM and HDD, emerging Non-Volatile RAMs (NVRAMs) such as Phase Change Memory (PCM) or Spin-Transfer-Toque Magnetoresistive RAM (STT-MRAM) have been actively explored as new media of future memory hierarchy. However, since these NVRAMs have quite different characteristics from DRAM and HDD, integrating NVRAMs into conventional memory hierarchy requires significant architectural re-considerations and changes, imposing additional and complicated design trade-offs on the memory hierarchy design. This work assumes a future system in which both main memory and secondary storage include NVRAMs and are placed on the same memory bus. In this system organization, this dissertation work has addressed a problem facing the efficient exploitation of NVRAMs and DRAM integrated into a future platform’s memory hierarchy. Especially, this dissertation has investigated the system performance and lifetime improvement endowed by a novel system architecture called Memorage which co-manages all available physical NVRAM resources for main memory and storage at a system-level. Also, the work has studied the impact of a model-guided, hardware-driven page swap in a hybrid main memory on the application performance. Together, the two ideas enable a future system to ameliorate high system performance degradation under heavy memory pressure and to avoid an inefficient use of DRAM capacity due to injudicious page swap decisions. In summary, this research has not only demonstrated how emerging NVRAMs can be effectively employed and integrated in order to enhance the performance and endurance of a future system, but also helped system architects understand important design trade-offs for emerging NVRAMs based memory and storage systems

    데이터 집약적 응용의 효율적인 시스템 자원 활용을 위한 메모리 서브시스템 최적화

    Get PDF
    학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2020. 8. 염헌영.With explosive data growth, data-intensive applications, such as relational database and key-value storage, have been increasingly popular in a variety of domains in recent years. To meet the growing performance demands of data-intensive applications, it is crucial to efficiently and fully utilize memory resources for the best possible performance. However, general-purpose operating systems (OSs) are designed to provide system resources to applications running on a system in a fair manner at system-level. A single application may find it difficult to fully exploit the systems best performance due to this system-level fairness. For performance reasons, many data-intensive applications implement their own mechanisms that OSs already provide, under the assumption that they know better about the data than OSs. They can be greedily optimized for performance but this may result in inefficient use of system resources. In this dissertation, we claim that simple OS support with minor application modifications can yield even higher application performance without sacrificing system-level resource utilization. We optimize and extend OS memory subsystem for better supporting applications while addressing three memory-related issues in data-intensive applications. First, we introduce a memory-efficient cooperative caching approach between application and kernel buffer to address double caching problem where the same data resides in multiple layers. Second, we present a memory-efficient, transparent zero-copy read I/O scheme to avoid the performance interference problem caused by memory copy behavior during I/O. Third, we propose a memory-efficient fork-based checkpointing mechanism for in-memory database systems to mitigate the memory footprint problem of the existing fork-based checkpointing scheme; memory usage increases incrementally (up to 2x) during checkpointing for update-intensive workloads. To show the effectiveness of our approach, we implement and evaluate our schemes on real multi-core systems. The experimental results demonstrate that our cooperative approach can more effectively address the above issues related to data-intensive applications than existing non-cooperative approaches while delivering better performance (in terms of transaction processing speed, I/O throughput, or memory footprint).최근 폭발적인 데이터 성장과 더불어 데이터베이스, 키-밸류 스토리지 등의 데이터 집약적인 응용들이 다양한 도메인에서 인기를 얻고 있다. 데이터 집약적인 응용의 높은 성능 요구를 충족하기 위해서는 주어진 메모리 자원을 효율적이고 완벽하게 활용하는 것이 중요하다. 그러나, 범용 운영체제(OS)는 시스템에서 수행 중인 모든 응용들에 대해 시스템 차원에서 공평하게 자원을 제공하는 것을 우선하도록 설계되어있다. 즉, 시스템 차원의 공평성 유지를 위한 운영체제 지원의 한계로 인해 단일 응용은 시스템의 최고 성능을 완전히 활용하기 어렵다. 이러한 이유로, 많은 데이터 집약적 응용은 운영체제에서 제공하는 기능에 의지하지 않고 비슷한 기능을 응용 레벨에 구현하곤 한다. 이러한 접근 방법은 탐욕적인 최적화가 가능하다는 점에서 성능 상 이득이 있을 수 있지만, 시스템 자원의 비효율적인 사용을 초래할 수 있다. 본 논문에서는 운영체제의 지원과 약간의 응용 수정만으로도 비효율적인 시스템 자원 사용 없이 보다 높은 응용 성능을 보일 수 있음을 증명하고자 한다. 그러기 위해 운영체제의 메모리 서브시스템을 최적화 및 확장하여 데이터 집약적인 응용에서 발생하는 세 가지 메모리 관련 문제를 해결하였다. 첫째, 동일한 데이터가 여러 계층에 존재하는 중복 캐싱 문제를 해결하기 위해 응용과 커널 버퍼 간에 메모리 효율적인 협력 캐싱 방식을 제시하였다. 둘째, 입출력시 발생하는 메모리 복사로 인한 성능 간섭 문제를 피하기 위해 메모리 효율적인 무복사 읽기 입출력 방식을 제시하였다. 셋째, 인-메모리 데이터베이스 시스템을 위한 메모리 효율적인 fork 기반 체크포인트 기법을 제안하여 기존 포크 기반 체크포인트 기법에서 발생하는 메모리 사용량 증가 문제를 완화하였다; 기존 방식은 업데이트 집약적 워크로드에 대해 체크포인팅을 수행하는 동안 메모리 사용량이 최대 2배까지 점진적으로 증가할 수 있었다. 본 논문에서는 제안한 방법들의 효과를 증명하기 위해 실제 멀티 코어 시스템에 구현하고 그 성능을 평가하였다. 실험결과를 통해 제안한 협력적 접근방식이 기존의 비협력적 접근방식보다 데이터 집약적 응용에게 효율적인 메모리 자원 활용을 가능하게 함으로써 더 높은 성능을 제공할 수 있음을 확인할 수 있었다.Chapter 1 Introduction 1 1.1 Motivation 1 1.1.1 Importance of Memory Resources 1 1.1.2 Problems 2 1.2 Contributions 5 1.3 Outline 6 Chapter 2 Background 7 2.1 Linux Kernel Memory Management 7 2.1.1 Page Cache 7 2.1.2 Page Reclamation 8 2.1.3 Page Table and TLB Shootdown 9 2.1.4 Copy-on-Write 10 2.2 Linux Support for Applications 11 2.2.1 fork 11 2.2.2 madvise 11 2.2.3 Direct I/O 12 2.2.4 mmap 13 Chapter 3 Memory Efficient Cooperative Caching 14 3.1 Motivation 14 3.1.1 Problems of Existing Datastore Architecture 14 3.1.2 Proposed Architecture 17 3.2 Related Work 17 3.3 Design and Implementation 19 3.3.1 Overview 19 3.3.2 Kernel Support 24 3.3.3 Migration to DBIO 25 3.4 Evaluation 27 3.4.1 System Configuration 27 3.4.2 Methodology 28 3.4.3 TPC-C Benchmarks 30 3.4.4 YCSB Benchmarks 32 3.5 Summary 37 Chapter 4 Memory Efficient Zero-copy I/O 38 4.1 Motivation 38 4.1.1 The Problems of Copy-Based I/O 38 4.2 Related Work 40 4.2.1 Zero Copy I/O 40 4.2.2 TLB Shootdown 42 4.2.3 Copy-on-Write 43 4.3 Design and Implementation 44 4.3.1 Prerequisites for z-READ 44 4.3.2 Overview of z-READ 45 4.3.3 TLB Shootdown Optimization 48 4.3.4 Copy-on-Write Optimization 52 4.3.5 Implementation 55 4.4 Evaluation 55 4.4.1 System Configurations 56 4.4.2 Effectiveness of the TLB Shootdown Optimization 57 4.4.3 Effectiveness of CoW Optimization 59 4.4.4 Analysis of the Performance Improvement 62 4.4.5 Performance Interference Intensity 63 4.4.6 Effectiveness of z-READ in Macrobenchmarks 65 4.5 Summary 67 Chapter 5 Memory Efficient Fork-based Checkpointing 69 5.1 Motivation 69 5.1.1 Fork-based Checkpointing 69 5.1.2 Approach 71 5.2 Related Work 73 5.3 Design and Implementation 74 5.3.1 Overview 74 5.3.2 OS Support 78 5.3.3 Implementation 79 5.4 Evaluation 80 5.4.1 Experimental Setup 80 5.4.2 Performance 81 5.5 Summary 86 Chapter 6 Conclusion 87 요약 100Docto

    PRISMA: a prefetching storage middleware for accelerating deep learning frameworks

    Get PDF
    Dissertação mestrado integrado em Informatics EngineeringDeep Learning (DL) is a widely used technique often applied to many domains, from computer vision to natural language processing. To avoid overfitting, DL applications have to access large amounts of data, which affects the training performance. Although significant hardware advances have already been made, current storage systems cannot keep up with the needs required by DL techniques. Considering this, multiple storage solutions have already been developed to improve the Input/Output (I/O) performance of DL training. Nevertheless, they are either specific to certain DL frameworks or present drawbacks, such as loss of accuracy. Most DL frameworks also contain internal I/O optimizations, however they cannot be easily decoupled and applied to other frameworks. Furthermore, most of these optimizations have to be manually configured or comprise greedy provisioning algorithms that waste computational resources. To address these issues, we propose PRISMA, a novel storage middleware that employs data prefetching and parallel I/O to improve DL training performance. PRISMA provides an autotuning mechanism to automatically select the optimal configuration. This mechanism was designed to achieve a good trade-off between performance and resource usage. PRISMA is framework-agnostic, meaning that it can be applied to any DL framework, and does not impact the accuracy of the training model. In addition to PRISMA, we provide a thorough study and evaluation of the TensorFlow Dataset Application Programming Interface (API), demonstrating that local DL can benefit from I/O optimization. PRISMA was integrated and evaluated with two popular DL frameworks, namely Tensor Flow and PyTorch, proving that it is successful under different I/O workloads. Experimental results demonstrate that PRISMA is the most efficient solution for the majority of the scenar ios that were studied, while for the other scenarios exhibits similar performance to built-in optimizations of TensorFlow and PyTorch.Aprendizagem Profunda (AP) é uma área bastante abrangente que é atualmente utilizada em diversos domínios, como é o caso da visão por computador e do processamento de linguagem natural. A aplicação de técnicas de AP implica o acesso a grandes quantidades de dados, o que afeta o desempenho de treino. Embora já tenham sido alcançados avanços significativos em termos de hardware, os sistemas de armazenamento atuais não conseguem acompanhar os requisitos de desempenho que os mecanismos de AP impõem. Considerando isto, foram desenvolvidas várias soluções de armazenamento com o objetivo de melhorar o desempenho de Entrada/Saída (E/S) do treino de AP. No entanto, as soluções existentes possuem certas desvantagens, nomeadamente perda de precisão do modelo de treino e o facto de serem específicas a determinadas plataformas de AP. A maioria das plataformas de AP também possuem otimizações de E/S, contudo essas otimizações não podem ser facilmente desacopladas e aplicadas a outras plataformas. Para além disto, a maioria destas otimizações tem que ser configurada manualmente ou contém algoritmos de provisionamento gananciosos, que desperdiçam recursos computacionais. Para resolver os problemas anteriormente mencionados, esta dissertação propõe o PRISMA, um middleware de armazenamento que executa pré-busca de dados e paralelismo de E/S, de forma a melhorar o desempenho de treino de AP. O PRISMA providencia um mecanismo de configuração automática para determinar uma combinação de parâmetros ótima. Este mecanismo foi desenvolvido com o objetivo de obter um bom equilíbrio entre desempenho e utilização de recursos. O PRISMA é independente da plataforma de AP e não afeta a precisão do modelo de treino. Além do PRISMA, esta dissertação providencia um estudo e uma avaliação detalhados da Interface de Programação de Aplicações (API) Dataset do TensorFlow, provando que AP local pode beneficiar de otimizações de E/S. O PRISMA foi integrado e avaliado com duas plataformas de AP amplamente utilizadas, o TensorFlow e o PyTorch, demonstrando que este middleware tem sucesso sob diferentes cargas de trabalho de E/S. Os resultados experimentais demonstram que o PRISMA é a solução mais eficiente na maioria dos cenários estudados, e possui um desempenho semelhante às otimizações internas do TensorFlow e do PyTorch.Fundação para a Ciência e a Tecnologia (FCT) - project UIDB/50014/202
    corecore