20 research outputs found

    Hera Object Storage : a seamless, automated multi-tiering solution on top of OpenStack Swift

    Get PDF
    Over the last couple of decades, the demand for storage in the Cloud has grown exponentially. Distributed Cloud storage and object storage for the increasing share of unstructured data, are in high focus in both academic and industrial research activities. At the same time, efficient storage and the corresponding costs are often contrasting parameters raising a trade-off problem for any proposed solution. To this aim, classifying the data in terms of access probability became a hot topic. This paper introduces Hera Object Storage, a storage system built on top of OpenStack Swift that aims at selecting the most appropriate storage tier for any object to be stored. The goal of the multi-tiering storage we propose is to be automated and seamless, guaranteeing the required storage performance at the lowest possible cost. The paper discusses the design challenges, the proposed algorithmic solutions to the scope and, based on a prototype implementation it presents a basic proof-of-concept validation

    Performance analysis of an iSCSI block device in virtualized environment

    Get PDF
    Virtualization is new to telecom but it has been already implemented in IT sectors. Thus its benefits are already proven, which drags other sectors attention towards it. Now the telecom organizations are also focusing on virtualization to reap the full benefits of it. The main focus of this thesis is to conduct a performance analysis of a block storage device in a virtualization environment. Storage performance plays vital role in telecom sector. The performance and the reliability of the storage device is more important factor to fulfill the client request with minimum latency. This thesis is comprised of three main areas. The first literature part is to study the different storage networking possibilities and the different storage protocol practice to establish communication between server and the storage in the storage area network. The study indicated that Internet Small Computer System Interface (iSCSI) has more advantages than other practices in the storage area network. The second part covers the design of storage area network (SAN) solution. The storage is offered by an iSCSI storage server. It offers a block level storage device access to the compute server. Different iSCSI targets are available in market, performance of those were compared. Linux-IO Target was concluded as better iSCSI target with better performance and reliability. The Storage server was implemented as a virtual machine for better resource utilization, thus there was a study about the hypervisor and the different networking options for the virtual machines were compared. The final part is to optimize the SAN solution. Multipathing, different caching options and different driver options provided by the kernel virtual machine (KVM)/ Quick emulators (QEMU) were considered for optimization

    Modeling Information Lifecycle Management

    Get PDF

    Improving Data Management and Data Movement Efficiency in Hybrid Storage Systems

    Get PDF
    University of Minnesota Ph.D. dissertation.July 2017. Major: Computer Science. Advisor: David Du. 1 computer file (PDF); ix, 116 pages.In the big data era, large volumes of data being continuously generated drive the emergence of high performance large capacity storage systems. To reduce the total cost of ownership, storage systems are built in a more composite way with many different types of emerging storage technologies/devices including Storage Class Memory (SCM), Solid State Drives (SSD), Shingle Magnetic Recording (SMR), Hard Disk Drives (HDD), and even across off-premise cloud storage. To make better utilization of each type of storage, industries have provided multi-tier storage through dynamically placing hot data in the faster tiers and cold data in the slower tiers. Data movement happens between devices on one single device and as well as between devices connected via various networks. Toward improving data management and data movement efficiency in such hybrid storage systems, this work makes the following contributions: To bridge the giant semantic gap between applications and modern storage systems, passing a piece of tiny and useful information (I/O access hints) from upper layers to the block storage layer may greatly improve application performance or ease data management in heterogeneous storage systems. We present and develop a generic and flexible framework, called HintStor, to execute and evaluate various I/O access hints on heterogeneous storage systems with minor modifications to the kernel and applications. The design of HintStor contains a new application/user level interface, a file system plugin and a block storage data manager. With HintStor, storage systems composed of various storage devices can perform pre-devised data placement, space reallocation and data migration polices assisted by the added access hints. Each storage device/technology has its own unique price-performance tradeoffs and idiosyncrasies with respect to workload characteristics they prefer to support. To explore the internal access patterns and thus efficiently place data on storage systems with fully connected (i.e., data can move from one device to any other device instead of moving tier by tier) differential pools (each pool consists of storage devices of a particular type), we propose a chunk-level storage-aware workload analyzer framework, simplified as ChewAnalyzer. With ChewAnalzyer, the storage manager can adequately distribute and move the data chunks across different storage pools. To reduce the duplicate content transferred between local storage devices and devices in remote data centers, an inline Network Redundancy Elimination (NRE) process with Content-Defined Chunking (CDC) policy can obtain a higher Redundancy Elimination (RE) ratio but may suffer from a considerably higher computational requirement than fixed-size chunking. We build an inline NRE appliance which incorporates an improved FPGA based scheme to speed up CDC processing. To efficiently utilize the hardware resources, the whole NRE process is handled by a Virtualized NRE (VNRE) controller. The uniqueness of this VNRE that we developed lies in its ability to exploit the redundancy patterns of different TCP flows and customize the chunking process to achieve a higher RE ratio

    Improving Storage with Stackable Extensions

    Get PDF
    Storage is a central part of computing. Driven by exponentially increasing content generation rate and a widening performance gap between memory and secondary storage, researchers are in the perennial quest to push for further innovation. This has resulted in novel ways to “squeeze” more capacity and performance out of current and emerging storage technology. Adding intelligence and leveraging new types of storage devices has opened the door to a whole new class of optimizations to save cost, improve performance, and reduce energy consumption. In this dissertation, we first develop, analyze, and evaluate three storage exten- sions. Our first extension tracks application access patterns and writes data in the way individual applications most commonly access it to benefit from the sequential throughput of disks. Our second extension uses a lower power flash device as a cache to save energy and turn off the disk during idle periods. Our third extension is designed to leverage the characteristics of both disks and solid state devices by placing data in the most appropriate device to improve performance and save power. In developing these systems, we learned that extending the storage stack is a complex process. Implementing new ideas incurs a prolonged and cumbersome de- velopment process and requires developers to have advanced knowledge of the entire system to ensure that extensions accomplish their goal without compromising data recoverability. Futhermore, storage administrators are often reluctant to deploy specific storage extensions without understanding how they interact with other ex- tensions and if the extension ultimately achieves the intended goal. We address these challenges by using a combination of approaches. First, we simplify the stor- age extension development process with system-level infrastructure that implements core functionality commonly needed for storage extension development. Second, we develop a formal theory to assist administrators deploy storage extensions while guaranteeing that the given high level goals are satisfied. There are, however, some cases for which our theory is inconclusive. For such scenarios we present an experi- mental methodology that allows administrators to pick an extension that performs best for a given workload. Our evaluation demostrates the benefits of both the infrastructure and the formal theory

    Matching distributed file systems with application workloads

    Get PDF
    Modern storage systems have a large number of configurable parameters, distributed over many layers of abstraction. The number of combinations of these parameters, that can be altered to create an instance of such a system, is enormous. In practise, many of these parameters are never altered; instead default values, intended to support generic workloads and access patterns, are used. As systems become larger and evolve to support different workloads, the appropriateness of using default parameters in this way comes into question. This thesis examines the implications of changing some of these parameters and explores the effects these changes have on performance. As part of that work multiple contributions have been made, including the creation of a structured method to create and evaluate different storage configurations, choosing appropriate access sizes for the evaluation, picking representative cloud workloads and capturing storage traces for further analysis, extraction of the workload storage characteristics, creating logical partitions of the distributed file system used for the optimization, the creation of heterogeneous storage pools within the homogeneous system and the mapping and evaluation of the chosen workloads to the examined configurations

    Improving Caches in Consolidated Environments

    Get PDF
    Memory (cache, DRAM, and disk) is in charge of providing data and instructions to a computer’s processor. In order to maximize performance, the speeds of the memory and the processor should be equal. However, using memory that always match the speed of the processor is prohibitively expensive. Computer hardware designers have managed to drastically lower the cost of the system with the use of memory caches by sacrificing some performance. A cache is a small piece of fast memory that stores popular data so it can be accessed faster. Modern computers have evolved into a hierarchy of caches, where a memory level is the cache for a larger and slower memory level immediately below it. Thus, by using caches, manufacturers are able to store terabytes of data at the cost of cheapest memory while achieving speeds close to the speed of the fastest one. The most important decision about managing a cache is what data to store in it. Failing to make good decisions can lead to performance overheads and over- provisioning. Surprisingly, caches choose data to store based on policies that have not changed in principle for decades. However, computing paradigms have changed radically leading to two noticeably different trends. First, caches are now consol- idated across hundreds to even thousands of processes. And second, caching is being employed at new levels of the storage hierarchy due to the availability of high-performance flash-based persistent media. This brings four problems. First, as the workloads sharing a cache increase, it is more likely that they contain dupli- cated data. Second, consolidation creates contention for caches, and if not managed carefully, it translates to wasted space and sub-optimal performance. Third, as contented caches are shared by more workloads, administrators need to carefully estimate specific per-workload requirements across the entire memory hierarchy in order to meet per-workload performance goals. And finally, current cache write poli- cies are unable to simultaneously provide performance and consistency guarantees for the new levels of the storage hierarchy. We addressed these problems by modeling their impact and by proposing solu- tions for each of them. First, we measured and modeled the amount of duplication at the buffer cache level and contention in real production systems. Second, we created a unified model of workload cache usage under contention to be used by administrators for provisioning, or by process schedulers to decide what processes to run together. Third, we proposed methods for removing cache duplication and to eliminate wasted space because of contention for space. And finally, we pro- posed a technique to improve the consistency guarantees of write-back caches while preserving their performance benefits

    Scalability in extensible and heterogeneous storage systems

    Get PDF
    The evolution of computer systems has brought an exponential growth in data volumes, which pushes the capabilities of current storage architectures to organize and access this information effectively: as the unending creation and demand of computer-generated data grows at an estimated rate of 40-60% per year, storage infrastructures need increasingly scalable data distribution layouts that are able to adapt to this growth with adequate performance. In order to provide the required performance and reliability, large-scale storage systems have traditionally relied on multiple RAID-5 or RAID-6 storage arrays, interconnected with high-speed networks like FibreChannel or SAS. Unfortunately, the performance of the current, most commonly-used storage technology-the magnetic disk drive-can't keep up with the rate of growth needed to sustain this explosive growth. Moreover, storage architectures based on solid-state devices (the successors of current magnetic drives) don't seem poised to replace HDD-based storage for the next 5-10 years, at least in data centers. Though the performance of SSDs significantly improves that of hard drives, it would cost the NAND industry hundreds of billions of dollars to build enough manufacturing plants to satisfy the forecasted demand. Besides the problems derived from technological and mechanical limitations, the massive data growth poses more challenges: to build a storage infrastructure, the most flexible approach consists in using pools of storage devices that can be expanded as needed by adding new devices or replacing older ones, thus seamlessly increasing the system's performance and capacity. This approach however, needs data layouts that can adapt to these topology changes and also exploit the potential performance offered by the hardware. Such strategies should be able to rebuild the data layout to accommodate the new devices in the infrastructure, extracting the utmost performance from the hardware and offering a balanced workload distribution. An inadequate data layout might not effectively use the enlarged capacity or better performance provided by newer devices, thus leading to unbalancing problems like bottlenecks or resource underusage. Besides, massive storage systems will inevitably be composed of a collection of heterogeneous hardware: as capacity and performance requirements grow, new storage devices must be added to cope with demand, but it is unlikely that these devices will have the same capacity or performance of those installed. Moreover, upon failure, disks are most commonly replaced by faster and larger ones, since it is not always easy (or cheap) to find a particular model of drive. In the long run, any large-scale storage system will have to cope with a myriad of devices. The title of this dissertation, "Scalability in Extensible and Heterogeneous Storage Systems", refers to the main focus of our contributions in scalable data distributions that can adapt to increasing volumes of data. Our first contribution is the design of a scalable data layout that can adapt to hardware changes while redistributing only the minimum data to keep a balanced workload. With the second contribution, we perform a comparative study on the influence of pseudo-random number generators in the performance and distribution quality of randomized layouts and prove that a badly chosen generator can degrade the quality of the strategy. Our third contribution is an an analysis of long-term data access patterns in several real-world traces to determine if it is possible to offer high performance and a balanced load with less than minimal data rebalancing. In our final contribution, we apply the knowledge learnt about long-term access patterns to design an extensible RAID architecture that can adapt to changes in the number of disks without migrating large amounts of data, and prove that it can be competitive with current RAID arrays with an overhead of at most 1.28% the storage capacity.L'evolució dels sistemes de computació ha dut un creixement exponencial dels volums de dades, que porta al límit la capacitat d'organitzar i accedir informació de les arquitectures d'emmagatzemament actuals. Amb una incessant creació de dades que creix a un ritme estimat del 40-60% per any, les infraestructures de dades requereixen de distribucions de dades cada cop més escalables que puguin adaptar-se a aquest creixement amb un rendiment adequat. Per tal de proporcionar aquest rendiment, els sistemes d'emmagatzemament de gran escala fan servir agregacions RAID5 o RAID6 connectades amb xarxes d'alta velocitat com FibreChannel o SAS. Malauradament, el rendiment de la tecnologia més emprada actualment, el disc magnètic, no creix prou ràpid per sostenir tal creixement explosiu. D'altra banda, les prediccions apunten que els dispositius d'estat sòlid, els successors de la tecnologia actual, no substituiran els discos magnètics fins d'aquí 5-10 anys. Tot i que el rendiment és molt superior, la indústria NAND necessitarà invertir centenars de milions de dòlars per construir prou fàbriques per satisfer la demanda prevista. A més dels problemes derivats de limitacions tècniques i mecàniques, el creixement massiu de les dades suposa més problemes: la solució més flexible per construir una infraestructura d'emmagatzematge consisteix en fer servir grups de dispositius que es poden fer créixer bé afegint-ne de nous, bé reemplaçant-ne els més vells, incrementant així la capacitat i el rendiment del sistema de forma transparent. Aquesta solució, però, requereix distribucions de dades que es puguin adaptar a aquests canvis a la topologia i explotar el rendiment potencial que el hardware ofereix. Aquestes distribucions haurien de poder reconstruir la col.locació de les dades per acomodar els nous dispositius, extraient-ne el màxim rendiment i oferint una càrrega de treball balancejada. Una distribució inadient pot no fer servir de manera efectiva la capacitat o el rendiment addicional ofert pels nous dispositius, provocant problemes de balanceig com colls d¿ampolla o infrautilització. A més, els sistemes d'emmagatzematge massius estaran inevitablement formats per hardware heterogeni: en créixer els requisits de capacitat i rendiment, es fa necessari afegir nous dispositius per poder suportar la demanda, però és poc probable que els dispositius afegits tinguin la mateixa capacitat o rendiment que els ja instal.lats. A més, en cas de fallada, els discos són reemplaçats per d'altres més ràpids i de més capacitat, ja que no sempre és fàcil (o barat) trobar-ne un model particular. A llarg termini, qualsevol arquitectura d'emmagatzematge de gran escala estarà formada per una miríade de dispositius diferents. El títol d'aquesta tesi, "Scalability in Extensible and Heterogeneous Storage Systems", fa referència a les nostres contribucions a la recerca de distribucions de dades escalables que es puguin adaptar a volums creixents d'informació. La primera contribució és el disseny d'una distribució escalable que es pot adaptar canvis de hardware només redistribuint el mínim per mantenir un càrrega de treball balancejada. A la segona contribució, fem un estudi comparatiu sobre l'impacte del generadors de números pseudo-aleatoris en el rendiment i qualitat de les distribucions pseudo-aleatòries de dades, i provem que una mala selecció del generador pot degradar la qualitat de l'estratègia. La tercera contribució és un anàlisi dels patrons d'accés a dades de llarga duració en traces de sistemes reals, per determinar si és possible oferir un alt rendiment i una bona distribució amb una rebalanceig inferior al mínim. A la contribució final, apliquem el coneixement adquirit en aquest estudi per dissenyar una arquitectura RAID extensible que es pot adaptar a canvis en el número de dispositius sense migrar grans volums de dades, i demostrem que pot ser competitiva amb les distribucions ideals RAID actuals, amb només una penalització del 1.28% de la capacita
    corecore