Search CORE

7 research outputs found

Prototyping a high-performance low-cost solid-state disk

Author
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2011
Field of study

Analysis and optimization of storage IO in distributed and massive parallel high performance systems

Author: Sayed Mohamed Salem el-
Publication venue
Publication date: 01/01/2011
Field of study

Although Moore’s law ensures the increase in computational power, IO performance appears to be left behind. This minimizes the benefits gained from increased computational power. Processors have to idle for a long time waiting for IO. Another factor that slows the IO communication is the increased parallelism required in today’s computations. Most modern processing units are built from multiple weak cores. Since IO has a low parallelism the weak cores will decrease system performance. Furthermore to avoid added delay of external storage, future High Performance Computing (HPC) systems will employ Active Storage Fabrics (ASF). These embed storage directly into large HPC systems. Single HPC node IO performance will therefore require optimization. This can only be achieved with a full understanding of the IO stack operations. The analysis of the IO stack under the new conditions of multi-core and massive parallelism leads to some important conclusions. The IO stack is generally built for single devices and is heavily optimized for HDD. Two main optimization approaches are taken. The first is optimizing the IO stack to accommodate parallelism. Conclusions on IO analysis shows that a design based on several parallel operating storage devices is the best approach for parallelism in the IO stack. A parallel IO device with unified storage space is introduced. The unified storage space allows for optimal function division among resources for both read and write. The design also avoids large parallel file systems overhead by using limited changes to a conventional file system. Furthermore the interface of the IO stack is not changed by the design. This is a rather important restriction to avoid application rewrite. The implementation of such a design is shown to result in an increase in performance. The second approach is Optimizing the IO stack for Solid State Drives (SSD). The optimization for the new storage technology demanded further analysis. These show that the IO stack requires revision on many levels for optimal accommodation of SSD. File system preallocation of free blocks is used as an example. Preallocation is important for data contingency on HDD. However due to fast random access of SSD preallocation represents an overhead. By careful analysis to the block allocation algorithms, preallocation is removed. As an additional optimization approach IO compression is suggested for future work. It can utilize idle cores during an IO transaction to perform on the fly IO data compression

Exploiting intrinsic flash properties to enhance modern storage systems

Author: Huang Jian
Publication venue: Georgia Institute of Technology
Publication date: 20/08/2018
Field of study

The longstanding goals of storage system design have been to provide simple abstractions for applications to efficiently access data while ensuring the data durability and security on a hardware device. The traditional storage system, which was designed for slow hard disk drive with little parallelism, does not fit for the new storage technologies such as the faster flash memory with high internal parallelism. The gap between the storage system software and flash device causes both resource inefficiency and sub-optimal performance. This dissertation focuses on the rethinking of the storage system design for flash memory with a holistic approach from the system level to the device level and revisits several critical aspects of the storage system design including the storage performance, performance isolation, energy-efficiency, and data security. The traditional storage system lacks full performance isolation between applications sharing the device because it does not make the software aware of the underlying flash properties and constraints. This dissertation proposes FlashBlox, a storage virtualization system that utilizes flash parallelism to provide hardware isolation between applications by assigning them on dedicated chips. FlashBlox reduces the tail latency of storage operations dramatically compared with the existing software-based isolation techniques while achieving uniform lifetime for the flash device. As the underlying flash device latency is reduced significantly compared to the conventional hard disk drive, the storage software overhead has become the major bottleneck. This dissertation presents FlashMap, a holistic flash-based storage stack that combines memory, storage and device-level indirections into a unified layer. By combining these layers, FlashMap reduces critical-path latency for accessing data in the flash device and improves DRAM caching efficiency significantly for flash management. The traditional storage software incurs energy-intensive storage operations due to the need for maintaining data durability and security for personal data, which has become a significant challenge for resource-constrained devices such as mobiles and wearables. This dissertation proposes WearDrive, a fast and energy-efficient storage system for wearables. WearDrive treats the battery-backed DRAM as non-volatile memory to store personal data and trades the connected phone’s battery for the wearable’s by performing large and energy-intensive tasks on the phone while performing small and energy-efficient tasks locally using battery-backed DRAM. WearDrive improves wearable’s battery life significantly with negligible impact to the phone’s battery life. The storage software which has been developed for decades is still vulnerable to malware attacks. For example, the encryption ransomware which is a malicious software that stealthily encrypts user files and demands a ransom to provide access to these files. Prior solutions such as ransomware detection and data backups have been proposed to defend against encryption ransomware. Unfortunately, by the time the ransomware is detected, some files already undergo encryption and the user is still required to pay a ransom to access those files. Furthermore, ransomware variants can obtain kernel privilege to terminate or destroy these software-based defense systems. This dissertation presents FlashGuard, a ransomware-tolerant SSD which has a firmware-level recovery system that allows effective data recovery from encryption ransomware. FlashGuard leverages the intrinsic flash properties to defend against the encryption ransomware and adds minimal overhead to regular storage operations.Ph.D

Scholarly Materials And Research @ Georgia Tech

An Introduction to Hyperdex and the Brave New World of High Performance, Scalable, Consistent, Faulttolerant Data Stores

Author: Bernard Wong
Emin
Gün Sirer
Robert Escriva
Publication venue
Publication date
Field of study

CiteSeerX

Operating System Support for High-Performance Solid State Drives

Author: Bjørling Matias
Publication venue: IT-Universitetet i København
Publication date: 01/01/2016
Field of study

The IT University of Copenhagen's Repository

HMC-Based Accelerator Design For Compressed Deep Neural Networks

Author: Min Chuhan
Publication venue
Publication date: 29/01/2020
Field of study

Deep Neural Networks (DNNs) offer remarkable performance of classifications and regressions in many high dimensional problems and have been widely utilized in real-word cognitive applications. In DNN applications, high computational cost of DNNs greatly hinder their deployment in resource-constrained applications, real-time systems and edge computing platforms. Moreover, energy consumption and performance cost of moving data between memory hierarchy and computational units are higher than that of the computation itself. To overcome the memory bottleneck, data locality and temporal data reuse are improved in accelerator design. In an attempt to further improve data locality, memory manufacturers have invented 3D-stacked memory where multiple layers of memory arrays are stacked on top of each other. Inherited from the concept of Process-In-Memory (PIM), some 3D-stacked memory architectures also include a logic layer that can integrate general-purpose computational logic directly within main memory to take advantages of high internal bandwidth during computation. In this dissertation, we are going to investigate hardware/software co-design for neural network accelerator. Specifically, we introduce a two-phase filter pruning framework for model compression and an accelerator tailored for efficient DNN execution on HMC, which can dynamically offload the primitives and functions to PIM logic layer through a latency-aware scheduling controller. In our compression framework, we formulate filter pruning process as an optimization problem and propose a filter selection criterion measured by conditional entropy. The key idea of our proposed approach is to establish a quantitative connection between filters and model accuracy. We define the connection as conditional entropy over filters in a convolutional layer, i.e., distribution of entropy conditioned on network loss. Based on the definition, different pruning efficiencies of global and layer-wise pruning strategies are compared, and two-phase pruning method is proposed. The proposed pruning method can achieve a reduction of 88% filters and 46% inference time reduction on VGG16 within 2% accuracy degradation. In this dissertation, we are going to investigate hardware/software co-design for neural network accelerator. Specifically, we introduce a two-phase filter pruning framework for model compres- sion and an accelerator tailored for efficient DNN execution on HMC, which can dynamically offload the primitives and functions to PIM logic layer through a latency-aware scheduling con- troller. In our compression framework, we formulate filter pruning process as an optimization problem and propose a filter selection criterion measured by conditional entropy. The key idea of our proposed approach is to establish a quantitative connection between filters and model accuracy. We define the connection as conditional entropy over filters in a convolutional layer, i.e., distribution of entropy conditioned on network loss. Based on the definition, different pruning efficiencies of global and layer-wise pruning strategies are compared, and two-phase pruning method is proposed. The proposed pruning method can achieve a reduction of 88% filters and 46% inference time reduction on VGG16 within 2% accuracy degradation

D-Scholarship@Pitt