Search CORE

6 research outputs found

Assise: Performance and Availability via NVM Colocation in a Distributed File System

Author: Anderson Thomas E.
Canini Marco
Kim Jongyul
Kostić Dejan
Kwon Youngjin
Peter Simon
Reda Waleed
Schuh Henry N.
Witchel Emmett
Publication venue
Publication date: 01/01/2020
Field of study

The adoption of very low latency persistent memory modules (PMMs) upends the long-established model of disaggregated file system access. Instead, by colocating computation and PMM storage, we can provide applications much higher I/O performance, sub-second application failover, and strong consistency. To demonstrate this, we built the Assise distributed file system, based on a persistent, replicated coherence protocol for managing a set of server-colocated PMMs as a fast, crash-recoverable cache between applications and slower disaggregated storage, such as SSDs. Unlike disaggregated file systems, Assise maximizes locality for all file IO by carrying out IO on colocated PMM whenever possible and minimizes coherence overhead by maintaining consistency at IO operation granularity, rather than at fixed block sizes. We compare Assise to Ceph/Bluestore, NFS, and Octopus on a cluster with Intel Optane DC PMMs and SSDs for common cloud applications and benchmarks, such as LevelDB, Postfix, and FileBench. We find that Assise improves write latency up to 22x, throughput up to 56x, fail-over time up to 103x, and scales up to 6x better than its counterparts, while providing stronger consistency semantics. Assise promises to beat the MinuteSort world record by 1.5x

arXiv.org e-Print Archive

Publikationer från KTH

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Recommended from our members

Building Distributed Systems with Non-Volatile Main Memories and RDMA Networks

Author: Yang Jian
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

High-performance, byte-addressable non-volatile main memories (NVMMs) allow application developers to combine storage and memory into a single layer. These high-performance storage systems would be especially useful in large-scale data center environments where data is distributed and replicated across multiple servers.Unfortunately, existing approaches of providing remote storage access rest on the assumption that storage is slow, so the cost of the software and protocols is acceptable. Such assumption no longer holds for the fast NVMM. As a result, taking full advantage of NVMMs’ potential will require changes in system software and networking protocol. This thesis focuses on accessing remote NVMM efficiently using remote direct memory access (RDMA) network. RDMA enables a client to directly access memory on a remote machine without involving its local CPU.This thesis first presents Mojim, a system that provides replicated, reliable, and highly-available NVMM as an operating system service. Applications can access data in Mojim using normal load and store instructions while controlling when and how updates propagate to replicas using system calls. Our evaluation shows Mojim adds little overhead to the un-replicated system and provides 0.4x to 2.7x the throughput of the un-replicated system.This thesis then presents Orion, a distributed file system designed from for NVMM and RDMA networks. Traditional distributed file systems are designed for slower hard drives. These slower media incentivizes complex optimizations (e.g., queuing, striping, and batching) around disk accesses. Orion combines file system functions and network operations into a single layer. It provides low latency metadata accesses and outperforms existing distributed file systems by a large margin.Finally, an NVMM application can map files backed by an NVMM file system into its address space, and accesses them using CPU instructions. In this case, RDMA and NVMM file systems introduce duplication of effort on permissions, naming, and address translation. We introduce two changes to the existing RDMA protocol: the file memory region (FileMR) and range based address translation. By eliminating redundant translations, FileMR minimizes the number of translations done at the NIC, reducing the load on the NIC’s translation cache and resulting in application performance improvement by 1.8x - 2.0x

eScholarship - University of California

Optimisation des caches de fichiers dans les environnements virtualisés

Author: Todeschi Grégoire
Publication venue
Publication date: 08/06/2020
Field of study

Les besoins en ressources de calcul sont en forte augmentation depuis plusieurs décennies, que ce soit pour des applications du domaine des réseaux sociaux, du calcul haute performance, ou du big data. Les entreprises se tournent alors vers des solutions d'externalisation de leurs services informatiques comme le Cloud Computing. Le Cloud Computing permet une mutalisation des ressources informatiques dans un datacenter et repose généralement sur la virtualisation. Cette dernière permet de décomposer une machine physique, appelée hôte, en plusieurs machines virtuelles (VM) invitées. La virtualisation engendre de nouveaux défis dans la conception des systèmes d'exploitation, en particulier pour la gestion de la mémoire. La mémoire est souvent utilisée pour accélérer les coûteux accès aux disques, en conservant ou préchargeant les données du disque dans le cache fichiers. Seulement la mémoire est une ressource limitée et limitante pour les environnements virtualisés, affectant ainsi les performances des applications utilisateurs. Il est alors nécessaire d'optimiser l'utilisation du cache de fichiers dans ces environnements. Dans cette thèse, nous proposons deux approches orthogonales pour améliorer les performances des applications à l'aide d'une meilleure utilisation du cache fichiers. Dans les environnements virtualisés, hôte et invités exécutent chacun leur propre système d'exploitation (OS) et ont donc chacun un cache de fichiers. Lors de la lecture d'un fichier, les données se retrouvent présentes dans les deux caches. Seulement, les deux OS exploitent la même mémoire physique. On parle de duplication des pages du cache. La première contribution vise à pallier ce problème avec Cacol, une politique d'éviction de cache s'exécutant dans l'hôte et non intrusive vis-à-vis de la VM. Cacol évite ces doublons de pages réduisant ainsi l'utilisation de la mémoire d'une machine physique. La seconde approche est d'étendre le cache fichiers des VM en exploitant de la mémoire disponible sur d'autres machines du datacenter. Cette seconde contribution, appelée Infinicache, s'appuie sur Infiniband, un réseau RDMA à haute vitesse, et exploite sa capacité à lire et à écrire sur de la mémoire à distance. Directement implémenté dans le cache invité, Infinicache stocke les pages évincées de son cache sur de la mémoire à distance. Les futurs accès à ces pages sont alors plus rapides que des accès aux disques de stockage, améliorant par conséquent les performances des applications. De plus, le taux d'utilisation de la mémoire à l'échelle du datacenter est augmenté, réduisant le gaspillage de manière globale

Thèses en Ligne

Scientific Publications of the University of Toulouse II Le Mirail

Open Archive Toulouse Archive Ouverte

Incast mitigation in a data center storage cluster through a dynamic fair-share buffer policy

Author: Abbas Haider
Bangash Yawar Abbas
Imran Muhammad Ali
Khan Adnan Ahmed
Rana Tauseef
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

Incast is a phenomenon when multiple devices interact with only one device at a given time. Multiple storage senders overflow either the switch buffer or the single-receiver memory. This pattern causes all concurrent-senders to stop and wait for buffer/memory availability, and leads to a packet loss and retransmission—resulting in a huge latency. We present a software-defined technique tackling the many-to-one communication pattern—Incast—in a data center storage cluster. Our proposed method decouples the default TCP windowing mechanism from all storage servers, and delegates it to the software-defined storage controller. The proposed method removes the TCP saw-tooth behavior, provides a global flow awareness, and implements the dynamic fair-share buffer policy for end-to-end I/O path. It considers all I/O stages (applications, device drivers, NICs, switches/routers, file systems, I/O schedulers, main memory, and physical disks) while achieving the maximum I/O throughput. The policy, which is part of the proposed method, allocates fair-share bandwidth utilization for all storage servers. Priority queues are incorporated to handle the most important data flows. In addition, the proposed method provides better manageability and maintainability compared with traditional storage networks, where data plane and control plane reside in the same device

Enlighten

ACCELERATING STORAGE APPLICATIONS WITH EMERGING KEY VALUE STORAGE DEVICES

Author: Qin Mian
Publication venue
Publication date: 24/01/2022
Field of study

With the continuous data explosion in the big data era, traditional software and hardware stack are facing unprecedented challenges on how to operate on such data scale. Thus, designing new architectures and efficient systems for data oriented applications has become increasingly critical. This motivates us to re-think of the conventional storage system design and re-architect both software and hardware to meet the challenges of scale. Besides the fast growth of data volume, the increasing demand on storage applications such as video streaming, data analytics are pushing high performance flash based storage devices to replace the traditional spinning disks. Such all-flash era increase the data reliability concerns due to the endurance problem of flash devices. Key-value stores (KVS) are important storage infrastructure to handle the fast growing unstructured data and have been widely deployed in a variety of scale-out enterprise applications such as online retail, big data analytic, social networks, etc. How to efficiently manage data redundancy for key-value stores to provide data reliability, how to efficiently support range query for key-value stores to accelerate analytic oriented applications under emerging key-value store system architecture become an important research problem. In this research, we focus on how to design new software hardware architectures for the keyvalue store applications to provide reliability and improve query performance. In order to address the different issues identified in this dissertation, we propose to employ a logical key management layer, a thin layer above the KV devices that maps logical keys into phsyical keys on the devices. We show how such a layer can enable multiple solutions to improve the performance and reliability of KVSSD based storage systems. First, we present KVRAID, a high performance, write efficient erasure coding management scheme on emerging key-value SSDs. The core innovation of KVRAID is to propose a logical key management layer that maps logical keys to physical keys to efficiently pack similar size KV objects and dynamically manage the membership of erasure coding groups. Unlike existing schemes which manage erasure codes on the block level, KVRAID manages the erasure codes on the KV object level. In order to achieve better storage efficiency for variable sized objects, KVRAID predefines multiple fixed sizes (slabs) according to the object size distribution for the erasure code. KVRAID uses a logical to physical key conversion to pack the KV objects of similar size into a parity group. KVRAID uses a lazy deletion mechanism with a garbage collector for object updates. Our experiments show that in 100% put case, KVRAID outperforms software block RAID by 18x in case of throughput and reduces 15x write amplification (WAF) with only ~5% CPU utilization. In a mixed update/get workloads, KVRAID achieves ~4x better throughput with ~23% CPU utilization and reduces the storage overhead and WAF by 3.6x and 11.3x in average respectively. Second, we present KVRangeDB, an ordered log structure tree based key index that supports range queries on a hash-based KVSSD. In addition, we propose to pack smaller application records into a larger physical record on the device through the logical key management layer. We compared the performance of KVRangeDB against RocksDB implementation on KVSSD and stateof- art software KV-store Wisckey on block device, on three types of real world applications of cloud-serving workloads, TABLEFS filesystem and time-series databases. For cloud serving applications, KVRangeDB achieves 8.3x and 1.7x better 99.9% write tail latency respectively compared to RocksDB implementation on KV-SSD and Wisckey on block SSD. On the query side, KVrangeDB only performs worse for those very long scans, but provides fast point queries and closed range queries. The experiments on TABLEFS demonstrate that using KVRangeDB for metadata indexing can boost the performance by a factor of ~6.3x in average and reduce ~3.9x CPU cost for four metadata-intensive workloads compared to RocksDB implementation on KVSSD. Compared toWisckey, KVRangeDB improves performance by ~2.6x in average and reduces ~1.7x CPU usage. Third, we propose a generic FPGA accelerator for emerging Minimum Storage Regenerating (MSR) codes encoding/decoding which maximizes the computation parallelism and minimizes the data movement between off-chip DRAM and the on-chip SRAM buffers. To demonstrate the efficiency of our proposed accelerator, we implemented the encoding/decoding algorithms for a specific MSR code called Zigzag code on Xilinx VCU1525 acceleration card. Our evaluation shows our proposed accelerator can achieve ~2.4-3.1x better throughput and ~4.2-5.7x better power efficiency compared to the state-of-art multi-core CPU implementation and ~2.8-3.3x better throughput and ~4.2-5.3x better power efficiency compared to a modern GPU accelerato

Texas A&M Repository