6 research outputs found

    Assise: Performance and Availability via NVM Colocation in a Distributed File System

    Full text link
    The adoption of very low latency persistent memory modules (PMMs) upends the long-established model of disaggregated file system access. Instead, by colocating computation and PMM storage, we can provide applications much higher I/O performance, sub-second application failover, and strong consistency. To demonstrate this, we built the Assise distributed file system, based on a persistent, replicated coherence protocol for managing a set of server-colocated PMMs as a fast, crash-recoverable cache between applications and slower disaggregated storage, such as SSDs. Unlike disaggregated file systems, Assise maximizes locality for all file IO by carrying out IO on colocated PMM whenever possible and minimizes coherence overhead by maintaining consistency at IO operation granularity, rather than at fixed block sizes. We compare Assise to Ceph/Bluestore, NFS, and Octopus on a cluster with Intel Optane DC PMMs and SSDs for common cloud applications and benchmarks, such as LevelDB, Postfix, and FileBench. We find that Assise improves write latency up to 22x, throughput up to 56x, fail-over time up to 103x, and scales up to 6x better than its counterparts, while providing stronger consistency semantics. Assise promises to beat the MinuteSort world record by 1.5x

    Optimisation des caches de fichiers dans les environnements virtualisés

    Get PDF
    Les besoins en ressources de calcul sont en forte augmentation depuis plusieurs décennies, que ce soit pour des applications du domaine des réseaux sociaux, du calcul haute performance, ou du big data. Les entreprises se tournent alors vers des solutions d'externalisation de leurs services informatiques comme le Cloud Computing. Le Cloud Computing permet une mutalisation des ressources informatiques dans un datacenter et repose généralement sur la virtualisation. Cette dernière permet de décomposer une machine physique, appelée hôte, en plusieurs machines virtuelles (VM) invitées. La virtualisation engendre de nouveaux défis dans la conception des systèmes d'exploitation, en particulier pour la gestion de la mémoire. La mémoire est souvent utilisée pour accélérer les coûteux accès aux disques, en conservant ou préchargeant les données du disque dans le cache fichiers. Seulement la mémoire est une ressource limitée et limitante pour les environnements virtualisés, affectant ainsi les performances des applications utilisateurs. Il est alors nécessaire d'optimiser l'utilisation du cache de fichiers dans ces environnements. Dans cette thèse, nous proposons deux approches orthogonales pour améliorer les performances des applications à l'aide d'une meilleure utilisation du cache fichiers. Dans les environnements virtualisés, hôte et invités exécutent chacun leur propre système d'exploitation (OS) et ont donc chacun un cache de fichiers. Lors de la lecture d'un fichier, les données se retrouvent présentes dans les deux caches. Seulement, les deux OS exploitent la même mémoire physique. On parle de duplication des pages du cache. La première contribution vise à pallier ce problème avec Cacol, une politique d'éviction de cache s'exécutant dans l'hôte et non intrusive vis-à-vis de la VM. Cacol évite ces doublons de pages réduisant ainsi l'utilisation de la mémoire d'une machine physique. La seconde approche est d'étendre le cache fichiers des VM en exploitant de la mémoire disponible sur d'autres machines du datacenter. Cette seconde contribution, appelée Infinicache, s'appuie sur Infiniband, un réseau RDMA à haute vitesse, et exploite sa capacité à lire et à écrire sur de la mémoire à distance. Directement implémenté dans le cache invité, Infinicache stocke les pages évincées de son cache sur de la mémoire à distance. Les futurs accès à ces pages sont alors plus rapides que des accès aux disques de stockage, améliorant par conséquent les performances des applications. De plus, le taux d'utilisation de la mémoire à l'échelle du datacenter est augmenté, réduisant le gaspillage de manière globale

    Incast mitigation in a data center storage cluster through a dynamic fair-share buffer policy

    Get PDF
    Incast is a phenomenon when multiple devices interact with only one device at a given time. Multiple storage senders overflow either the switch buffer or the single-receiver memory. This pattern causes all concurrent-senders to stop and wait for buffer/memory availability, and leads to a packet loss and retransmission—resulting in a huge latency. We present a software-defined technique tackling the many-to-one communication pattern—Incast—in a data center storage cluster. Our proposed method decouples the default TCP windowing mechanism from all storage servers, and delegates it to the software-defined storage controller. The proposed method removes the TCP saw-tooth behavior, provides a global flow awareness, and implements the dynamic fair-share buffer policy for end-to-end I/O path. It considers all I/O stages (applications, device drivers, NICs, switches/routers, file systems, I/O schedulers, main memory, and physical disks) while achieving the maximum I/O throughput. The policy, which is part of the proposed method, allocates fair-share bandwidth utilization for all storage servers. Priority queues are incorporated to handle the most important data flows. In addition, the proposed method provides better manageability and maintainability compared with traditional storage networks, where data plane and control plane reside in the same device

    ACCELERATING STORAGE APPLICATIONS WITH EMERGING KEY VALUE STORAGE DEVICES

    Get PDF
    With the continuous data explosion in the big data era, traditional software and hardware stack are facing unprecedented challenges on how to operate on such data scale. Thus, designing new architectures and efficient systems for data oriented applications has become increasingly critical. This motivates us to re-think of the conventional storage system design and re-architect both software and hardware to meet the challenges of scale. Besides the fast growth of data volume, the increasing demand on storage applications such as video streaming, data analytics are pushing high performance flash based storage devices to replace the traditional spinning disks. Such all-flash era increase the data reliability concerns due to the endurance problem of flash devices. Key-value stores (KVS) are important storage infrastructure to handle the fast growing unstructured data and have been widely deployed in a variety of scale-out enterprise applications such as online retail, big data analytic, social networks, etc. How to efficiently manage data redundancy for key-value stores to provide data reliability, how to efficiently support range query for key-value stores to accelerate analytic oriented applications under emerging key-value store system architecture become an important research problem. In this research, we focus on how to design new software hardware architectures for the keyvalue store applications to provide reliability and improve query performance. In order to address the different issues identified in this dissertation, we propose to employ a logical key management layer, a thin layer above the KV devices that maps logical keys into phsyical keys on the devices. We show how such a layer can enable multiple solutions to improve the performance and reliability of KVSSD based storage systems. First, we present KVRAID, a high performance, write efficient erasure coding management scheme on emerging key-value SSDs. The core innovation of KVRAID is to propose a logical key management layer that maps logical keys to physical keys to efficiently pack similar size KV objects and dynamically manage the membership of erasure coding groups. Unlike existing schemes which manage erasure codes on the block level, KVRAID manages the erasure codes on the KV object level. In order to achieve better storage efficiency for variable sized objects, KVRAID predefines multiple fixed sizes (slabs) according to the object size distribution for the erasure code. KVRAID uses a logical to physical key conversion to pack the KV objects of similar size into a parity group. KVRAID uses a lazy deletion mechanism with a garbage collector for object updates. Our experiments show that in 100% put case, KVRAID outperforms software block RAID by 18x in case of throughput and reduces 15x write amplification (WAF) with only ~5% CPU utilization. In a mixed update/get workloads, KVRAID achieves ~4x better throughput with ~23% CPU utilization and reduces the storage overhead and WAF by 3.6x and 11.3x in average respectively. Second, we present KVRangeDB, an ordered log structure tree based key index that supports range queries on a hash-based KVSSD. In addition, we propose to pack smaller application records into a larger physical record on the device through the logical key management layer. We compared the performance of KVRangeDB against RocksDB implementation on KVSSD and stateof- art software KV-store Wisckey on block device, on three types of real world applications of cloud-serving workloads, TABLEFS filesystem and time-series databases. For cloud serving applications, KVRangeDB achieves 8.3x and 1.7x better 99.9% write tail latency respectively compared to RocksDB implementation on KV-SSD and Wisckey on block SSD. On the query side, KVrangeDB only performs worse for those very long scans, but provides fast point queries and closed range queries. The experiments on TABLEFS demonstrate that using KVRangeDB for metadata indexing can boost the performance by a factor of ~6.3x in average and reduce ~3.9x CPU cost for four metadata-intensive workloads compared to RocksDB implementation on KVSSD. Compared toWisckey, KVRangeDB improves performance by ~2.6x in average and reduces ~1.7x CPU usage. Third, we propose a generic FPGA accelerator for emerging Minimum Storage Regenerating (MSR) codes encoding/decoding which maximizes the computation parallelism and minimizes the data movement between off-chip DRAM and the on-chip SRAM buffers. To demonstrate the efficiency of our proposed accelerator, we implemented the encoding/decoding algorithms for a specific MSR code called Zigzag code on Xilinx VCU1525 acceleration card. Our evaluation shows our proposed accelerator can achieve ~2.4-3.1x better throughput and ~4.2-5.7x better power efficiency compared to the state-of-art multi-core CPU implementation and ~2.8-3.3x better throughput and ~4.2-5.3x better power efficiency compared to a modern GPU accelerato
    corecore