221 research outputs found

    Scalable Storage for Digital Libraries

    Get PDF
    I propose a storage system optimised for digital libraries. Its key features are its heterogeneous scalability; its integration and exploitation of rich semantic metadata associated with digital objects; its use of a name space; and its aggressive performance optimisation in the digital library domain

    Research on Improving Reliability, Energy Efficiency and Scalability in Distributed and Parallel File Systems

    Get PDF
    With the increasing popularity of cloud computing and Big data applications, current data centers are often required to manage petabytes or exabytes of data. To store this huge amount of data, thousands or tens of thousands storage nodes are required at a single site. This imposes three major challenges for storage system designers: (1) Reliability---node failure in these datacenters is a normal occurrence rather than a rare situation. This makes data reliability a great concern. (2) Energy efficiency---a data center can consume up to 100 times more energy than a standard office building. More than 10% of this energy consumption can be attributed to storage systems. Thus, reducing the energy consumption of the storage system is key to reducing the overall consumption of the data center. (3) Scalability---with the continuously increasing size of data, maintaining the scalability of the storage systems is essential. That is, the expansion of the storage system should be completed efficiently and without limitations on the total number of storage nodes or performance. This thesis proposes three ways to improve the above three key features for current large-scale storage systems. Firstly, we define the problem of reverse lookup , namely finding the list of objects (blocks) for a failed node. As the first step of failure recovery, this process is directly related to the recovery/reconstruction time. While existing solutions use metadata traversal or data distribution reversing methods for reverse lookup, which are either time consuming or expensive, a deterministic block placement can achieve fast and efficient reverse lookup. However, the deterministic placement solutions are designed for centralized, small-scale storage architectures such as RAID etc.. Due to their lacking of scalability, they cannot be directly applied in large-scale storage systems. In this paper, we propose Group-Shifted Declustering (G-SD), a deterministic data layout for multi-way replication. G-SD addresses the scalability issue of our previous Shifted Declustering layout and supports fast and efficient reverse lookup. Secondly, we define a problem: how to balance the performance, energy, and recovery in degradation mode for an energy efficient storage system? . While extensive researches have been proposed to tradeoff performance for energy efficiency under normal mode, the system enters degradation mode when node failure occurs, in which node reconstruction is initiated. This very process requires a number of disks to be spun up and requires a substantial amount of I/O bandwidth, which will not only compromise energy efficiency but also performance. Without considering the I/O bandwidth contention between recovery and performance, we find that the current energy proportional solutions cannot answer this question accurately. This thesis present PERP, a mathematical model to minimize the energy consumption for a storage systems with respect to performance and recovery. PERP answers this problem by providing the accurate number of nodes and the assigned recovery bandwidth at each time frame. Thirdly, current distributed file systems such as Google File System(GFS) and Hadoop Distributed File System (HDFS), employ a pseudo-random method for replica distribution and a centralized lookup table (block map) to record all replica locations. This lookup table requires a large amount of memory and consumes a considerable amount of CPU/network resources on the metadata server. With the booming size of Big Data , the metadata server becomes a scalability and performance bottleneck. While current approaches such as HDFS Federation attempt to horizontally extend scalability by allowing multiple metadata servers, we believe a more promising optimization option is to vertically scale up each metadata server. We propose Deister, a novel block management scheme that builds on top of a deterministic declustering distribution method Intersected Shifted Declustering (ISD). Thus both replica distribution and location lookup can be achieved without a centralized lookup table

    A Fully Userspace Remote Storage Access Stack

    Get PDF
    As computer networking has evolved and the available throughput has increased, the efficiency of the network software stack has become increasingly important. This is because the latency introduced by software has gone from insignificant, compared to historically poor network performance, to the largest component of latency for a modern local-area network. Currently, the vast majority of code that accesses the hardware is part of the kernel, because the kernel is responsible for ensuring that user applications do not interfere with each other when accessing the hardware. Remote Direct Memory Access~(RDMA) provides a solution for applications to perform direct data transfers over the network without requiring context switches into the kernel, but relies instead on specialized hardware interfaces to handle the virtual address mappings and transport protocols. This more intelligent hardware allows for direct control from the userspace application, eliminating the cost of context switches into the kernel. This in turn reduces the overall latency of message transfers. Just like networking, storage is currently undergoing a similar evolution. For most of the recent history of computing, the most common durable storage mechanism has been mechanical hard disk drives, which can only be accessed at block level and have high latency compared to the software drivers used to access the data. However, the introduction of solid state disks~(SSDs) based on Flash significantly decreased the latency, as there are no mechanical parts that need to move to access the data. Upcoming non-volatile memory solutions reduce this latency even further, and even allow byte-level access to the storage medium. Thus, just like with networking, software drivers become the bottleneck and we look for solutions to bypass the kernel to improve the efficiency of direct userspace access to storage. This thesis offers two contributions as part of a solution to these problems. The first part introduces urdma, a software RDMA driver which leverages the Data Plane Development Kit (DPDK) to perform network data transfers in userspace without specialized RDMA interface hardware. The second part examines remote locking protocols, which are required for synchronization in distributed storage systems. We define an RDMA locking mechanism referred to as Verbs Offload Locking Technology (VOLT), which allows acquisition of a remote lock object without any CPU usage by the target node. This offloading allows VOLT to be used with disaggregated memory servers that have limited onboard CPU resources, while also lowering the application overhead for remote locking. Finally, we define a bytecode framework using enhanced Berkeley Packet Filter (eBPF) bytecode for extending the capabilities of an RDMA-capable network interface card (NIC) with new operations, and show how this can be used to implement our remote locking operation

    A MIDDLE-WARE LEVEL CLIENT CACHE FOR A HIGH PERFORMANCE COMPUTING I/O SIMULATOR

    Get PDF
    This thesis describes the design and run time analysis of the system level middle-ware cache for Hecios. Hecios is a high performance cluster I/O simulator. With Hecios, we provide a simulation environment that accurately captures the performance characteristics of all the components in a clusterwide parallel file system. Hecios was specifically modeled after PVFS2. It was designed to be extensible and to easily allow for various component modules to be easily replaced by those that model other system types. Built around the OMNeT++ simulation package, Hecios\u27 inner-cluster communication module is easily adaptable to any TCP/IP based protocol and all standard network interface cards, switches, hubs, and routers. We will examine the system cache component and describe a methodology for implementing other coherence and replacement techniques within Hecios. Similar to other cache simulation tools, we allow the size of the system cache to be varied independently of the replacement policy and caching technique used

    The expandable network disk

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.Includes bibliographical references (p. 91-96).This thesis presents a virtual disk cluster called END, the Expandable Network Disk. END aggregates storage on a cluster of servers into a single virtual disk. END's main goals are to offer good performance during normal operation, and efficiently handle changes in the cluster membership. END achieves these goals using a two-layer design, in which storage "bricks," servers that consist of CPU, memory, and hard disks, hold two kinds of information. The lower layer stores replicated immutable chunks of data, each indexed by a unique key. The upper layer maps each block address to the key of its current data chunk; each mapping is held on two bricks using primary-backup replication. This separation allows END flexibility in where it stores chunks and thus efficiency: it writes new chunks to bricks chosen for speed; it moves only address mappings (not data) when bricks fail and recover, which results in fast recovery; it fully replicates new writes during temporary brick failures; and it uses chunks on a recovered brick without risk of staleness. The END prototype's write throughput on a cluster of 16 PC-based bricks is 150 megabytes/s with 2x replication. END continues after a single brick failure, re-incorporates a rebooting brick, and expands to include a new brick after a few seconds of reduced performance during each change. The results show that END's two-layer design maintains good performance, resumes operation quickly after changes in the cluster, and maintains full replication of new writes even during a brick failure.by Athicha Muthitacharoen.Ph.D

    Improving Storage with Stackable Extensions

    Get PDF
    Storage is a central part of computing. Driven by exponentially increasing content generation rate and a widening performance gap between memory and secondary storage, researchers are in the perennial quest to push for further innovation. This has resulted in novel ways to “squeeze” more capacity and performance out of current and emerging storage technology. Adding intelligence and leveraging new types of storage devices has opened the door to a whole new class of optimizations to save cost, improve performance, and reduce energy consumption. In this dissertation, we first develop, analyze, and evaluate three storage exten- sions. Our first extension tracks application access patterns and writes data in the way individual applications most commonly access it to benefit from the sequential throughput of disks. Our second extension uses a lower power flash device as a cache to save energy and turn off the disk during idle periods. Our third extension is designed to leverage the characteristics of both disks and solid state devices by placing data in the most appropriate device to improve performance and save power. In developing these systems, we learned that extending the storage stack is a complex process. Implementing new ideas incurs a prolonged and cumbersome de- velopment process and requires developers to have advanced knowledge of the entire system to ensure that extensions accomplish their goal without compromising data recoverability. Futhermore, storage administrators are often reluctant to deploy specific storage extensions without understanding how they interact with other ex- tensions and if the extension ultimately achieves the intended goal. We address these challenges by using a combination of approaches. First, we simplify the stor- age extension development process with system-level infrastructure that implements core functionality commonly needed for storage extension development. Second, we develop a formal theory to assist administrators deploy storage extensions while guaranteeing that the given high level goals are satisfied. There are, however, some cases for which our theory is inconclusive. For such scenarios we present an experi- mental methodology that allows administrators to pick an extension that performs best for a given workload. Our evaluation demostrates the benefits of both the infrastructure and the formal theory

    Parallel replication for distributed video-on-demand systems.

    Get PDF
    Lie, Wai-Kwok Peter.Thesis (M.Phil.)--Chinese University of Hong Kong, 1997.Includes bibliographical references (leaves 79-83).Abstract --- p.iAcknowledgments --- p.iiChapter 1 --- Introduction --- p.1Chapter 2 --- Background & Related Work --- p.5Chapter 2.1 --- Early Work on Multimedia Servers --- p.6Chapter 2.2 --- Compression of Multimedia Data --- p.6Chapter 2.3 --- Multimedia File Systems --- p.7Chapter 2.4 --- Scheduling Support for Multimedia Systems --- p.8Chapter 2.5 --- Inter-media Synchronization --- p.9Chapter 2.6 --- Related Work on Replication in VOD Systems --- p.9Chapter 3 --- System Model --- p.12Chapter 4 --- Replication Methodology --- p.15Chapter 4.1 --- Replication Triggering Policy --- p.16Chapter 4.2 --- Source & Target Nodes Selection Policies --- p.17Chapter 4.3 --- Replication Policies --- p.18Chapter 4.3.1 --- Policy 1: Injected Sequential Replication --- p.20Chapter 4.3.2 --- Policy 2: Piggybacked Sequential Replication --- p.22Chapter 4.3.3 --- Policy 3: Injected Parallel Replication --- p.25Chapter 4.3.4 --- Policy 4: Piggybacked Parallel Replication --- p.28Chapter 4.3.5 --- Policy 5: Injected & Piggybacked Parallel Replication --- p.34Chapter 4.3.6 --- Policy 6: Multi-Source Injected & Piggybacked Parallel Replication --- p.36Chapter 4.4 --- Dereplication Policy --- p.37Chapter 5 --- Distributed Architecture for VOD Server --- p.39Chapter 5.1 --- Server Node --- p.40Chapter 5.2 --- Movie Manager --- p.42Chapter 5.3 --- Metadata Manager --- p.42Chapter 5.4 --- Protocols for Distributed VOD Architecture --- p.43Chapter 5.4.1 --- Protocol for Servicing New Customers --- p.43Chapter 5.4.2 --- Protocol for Servicing Existing Customers --- p.45Chapter 5.4.3 --- Protocol for Single/Multi-Source Injected & Parallel Replication --- p.46Chapter 5.4.4 --- Protocol for Dereplication --- p.48Chapter 5.5 --- Failure Handling --- p.49Chapter 5.5.1 --- Handling of Server Node Failures --- p.50Chapter 5.5.2 --- Handling of Movie Manager Failures --- p.52Chapter 6 --- Results --- p.55Chapter 6.1 --- Performance Metric --- p.56Chapter 6.2 --- Simulation Environment --- p.58Chapter 6.3 --- Results of Experiments without Dereplication --- p.59Chapter 6.3.1 --- Comparison of Different Replication Policies --- p.60Chapter 6.3.2 --- Effect of Early Acceptance/Migration --- p.61Chapter 6.3.3 --- Answer to the Resources Consumption Tradeoff issue --- p.62Chapter 6.3.4 --- Effect of Varying Movie Popularity Skewness --- p.64Chapter 6.3.5 --- Effect of Varying Replication Threshold --- p.64Chapter 6.3.6 --- Comparison of Different Target Node Selection Policies --- p.65Chapter 6.4 --- Overall Impact of Dynamic Replication --- p.66Chapter 7 --- Comparison with BSR-based Policy --- p.71Chapter 8 --- Conclusions --- p.75Chapter 8.1 --- Summary --- p.75Chapter 8.2 --- Future Research Directions --- p.76Bibliography --- p.7

    New techniques to model energy-aware I/O architectures based on SSD and hard disk drives

    Get PDF
    For years, performance improvements at the computer I/O subsystem and at other subsystems have advanced at their own pace, being less the improvements at the I/O subsystem, and making the overall system speed dependant of the I/O subsystem speed. One of the main factors for this imbalance is the inherent nature of disk drives, which has allowed big advances in disk densities, but not so many in disk performance. Thus, to improve I/O subsystem performance, disk drives have become a goal of study for many researchers, having to use, in some cases, different kind of models. Other research studies aim to improve I/O subsystem performance by tuning more abstract I/O levels. Since disk drives lay behind those levels, real disk drives or just models need to be used. One of the most common techniques to evaluate the performance of a computer I/O subsystem is found on detailed simulation models including specific features of storage devices like disk geometry, zone splitting, caching, read-ahead buffers and request reordering. However, as soon as a new technological innovation is added, those models need to be reworked to include new characteristics, making difficult to have general models up to date. Our alternative is modeling a storage device as a black-box probabilistic model, where the storage device itself, its interface and the interconnection mechanisms are modeled as a single stochastic process, defining the service time as a random variable with an unknown distribution. This approach allows generating disk service times needing less computational power by means of a variate generator included in a simulator. This approach allows to reach a greater scalability in I/O subsystems performance evaluations by means of simulation. Lately, energy saving for computing systems has become an important need. In mobile computers, the battery life is limited to a certain amount of time, and not wasting energy at certain parts would extend the usage of the computer. Here, again the computer I/O subsystem has pointed out as field of study, because disk drives, which are a main part of it, are one of the most power consuming elements due to their mechanical nature. In server or enterprise computers, where the number of disks increase considerably, power saving may reduce cooling requirements for heat dissipation and thus, great monetary costs. This dissertation also considers the question of saving energy in the disk drive, by making advantage of diverse devices in hybrid storage systems, composed of Solid State Disks (SSDs) and Disk drives. SSDs and Disk drives offer different power characteristics, being SSDs much less power consuming than disk drives. In this thesis, several techniques that use SSDs as supporting devices for Disk drives, are proposed. Various options for managing SSDs and Disk devices in such hybrid systems are examinated, and it is shown that the proposed methods save energy and monetary costs in diverse scenarios. A simulator composed of Disks and SSD devices was implemented. This thesis studies the design and evaluation of the proposed approaches with the help of realistic workloads. --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Durante años, las mejoras de rendimiento en el subsistema de E/S del ordenador y en otros subsistemas han avanzado a su propio ritmo, siendo menores las mejoras en el subsistema de E/S, y provocando que la velocidad global del sistema dependa de la velocidad del subsistema de E/S. Uno de los factores principales de este desequilibrio es la naturaleza inherente de las unidades de disco, la cual que ha permitido grandes avances en las densidades de disco, pero no así en su rendimiento. Por lo tanto, para mejorar el rendimiento del subsistema de E/S, las unidades de disco se han convertido en objetivo de estudio para muchos investigadores, que se ven obligados a utilizar, en algunos casos, diferentes tipos de modelos o simuladores. Otros estudios de investigación tienen como objetivo mejorar el rendimiento del subsistema de E/S, estudiando otros niveles más abstractos. Como los dispositivos de disco siguen estando detrás de esos niveles, tanto discos reales como modelos pueden usarse para esos estudios. Una de las técnicas más comunes para evaluar el rendimiento del subsistema de E/S de un ordenador se ha encontrado en los modelos de simulación detallada, los cuales modelan características específicas de los dispositivos de almacenamiento como la geometría del disco, la división en zonas, el almacenamiento en caché, el comportamiento de los buffers de lectura anticipada y la reordenación de solicitudes. Sin embargo, cuando se agregan innovaciones tecnológicas, los modelos tienen que ser revisados a fin de incluir nuevas características que incorporen dichas innovaciones, y esto hace difícil el tener modelos generales actualizados. Nuestra alternativa es el modelado de un dispositivo de almacenamiento como un modelo probabilístico de caja negra, donde el dispositivo de almacenamiento en sí, su interfaz y sus mecanismos de interconexión se tratan como un proceso estocástico, definiendo el tiempo de servicio como una variable aleatoria con una distribución desconocida. Este enfoque permite la generación de los tiempos de servicio del disco, de forma que se necesite menos potencia de cálculo a través del uso de un generador de variable aleatoria incluido en un simulador. De este modo, se permite alcanzar una mayor escalabilidad en la evaluación del rendimiento del subsistema de E/S a través de la simulación. En los últimos años, el ahorro de energía en los sistemas de computación se ha convertido en una necesidad importante. En ordenadores portátiles, la duración de la batería se limita a una cierta cantidad de tiempo, y no desperdiciar energía en ciertas partes haría más largo el uso del ordenador. Aquí, de nuevo el subsistema de E/S se señala como campo de estudio, ya que las unidades de disco, que son una parte principal del mismo, son uno de los elementos de más consumo de energía debido a su naturaleza mecánica. En los equipos de servidor o de empresa, donde el número de discos aumenta considerablemente, el ahorro de energía puede reducir las necesidades de refrigeración para la disipación de calor y por lo tanto, grandes costes monetarios. Esta tesis también considera la cuestión del ahorro energético en la unidad de disco, haciendo uso de diversos dispositivos en sistemas de almacenamiento híbridos, que emplean discos de estado sólido (SSD) y unidades de disco. Las SSD y unidades de disco ofrecen diferentes características de potencia, consumiendo las SSDs menos energía que las unidades de disco. En esta tesis se proponen varias técnicas que utilizan los SSD como apoyo a los dispositivos de disco. Se examinan las diversas opciones para la gestión de las SSD y los dispositivos de disco en tales sistemas híbridos, y se muestra que los métodos propuestos ahorran energía y costes monetarios en diversos escenarios. Se ha implementado un simulador compuesto por discos y dispositivos SSD. Esta tesis estudia el diseño y evaluación de los enfoques propuestos con la ayuda de las cargas de trabajo reales
    corecore