144 research outputs found

    A reliability model for dependent and distributed MDS disk array units

    Get PDF
    Archiving and systematic backup of large digital data generates a quick demand for multi-petabyte scale storage systems. As drive capacities continue to grow beyond the few terabytes range to address the demands of today’s cloud, the likelihood of having multiple/simultaneous disk failures became a reality. Among the main factors causing catastrophic system failures, correlated disk failures and the network bandwidth are reported to be the two common source of performance degradation. The emerging trend is to use efficient/sophisticated erasure codes (EC) equipped with multiple parities and efficient repairs in order to meet the reliability/bandwidth requirements. It is known that mean time to failure and repair rates reported by the disk manufacturers cannot capture life-cycle patterns of distributed storage systems. In this study, we develop failure models based on generalized Markov chains that can accurately capture correlated performance degradations with multiparity protection schemes based on modern maximum distance separable EC. Furthermore, we use the proposed model in a distributed storage scenario to quantify two example use cases: Primarily, the common sense that adding more parity disks are only meaningful if we have a decent decorrelation between the failure domains of storage systems and the reliability of generic multiple single-dimensional EC protected storage systems.WOS:000460728600008Scopus - Affiliation ID: 60105072Science Citation Index ExpandedQ1 - Q2ArticleUluslararası işbirliği ile yapılmayan - HAYIRMart2019YÖK - 2018-1

    Feasibility study for a numerical aerodynamic simulation facility. Volume 1

    Get PDF
    A Numerical Aerodynamic Simulation Facility (NASF) was designed for the simulation of fluid flow around three-dimensional bodies, both in wind tunnel environments and in free space. The application of numerical simulation to this field of endeavor promised to yield economies in aerodynamic and aircraft body designs. A model for a NASF/FMP (Flow Model Processor) ensemble using a possible approach to meeting NASF goals is presented. The computer hardware and software are presented, along with the entire design and performance analysis and evaluation

    Extending Scojo-PECT by migration based on application level checkpointing

    Get PDF
    In parallel computing, jobs have different runtimes and required computation resources. With runtimes correlated with resources, scheduling these jobs would be a packing problem getting the utilization and total execution time varies. Sometimes, resources are idle while jobs are preempted or have resource conflict with no chance to take use of them. This greatly wastes system resource at certain degree. Here we propose an approach which takes periodic checkpoints of running jobs with the chance to take advantage of migration to optimize our scheduler during long term scheduling. We improve our original Scojo-PECT preemptive scheduler which does not have checkpoint support before. We evaluate the gained execution time minus overhead of checkpointing/migration, to make comparison with original execution time

    Research on Improving Reliability, Energy Efficiency and Scalability in Distributed and Parallel File Systems

    Get PDF
    With the increasing popularity of cloud computing and Big data applications, current data centers are often required to manage petabytes or exabytes of data. To store this huge amount of data, thousands or tens of thousands storage nodes are required at a single site. This imposes three major challenges for storage system designers: (1) Reliability---node failure in these datacenters is a normal occurrence rather than a rare situation. This makes data reliability a great concern. (2) Energy efficiency---a data center can consume up to 100 times more energy than a standard office building. More than 10% of this energy consumption can be attributed to storage systems. Thus, reducing the energy consumption of the storage system is key to reducing the overall consumption of the data center. (3) Scalability---with the continuously increasing size of data, maintaining the scalability of the storage systems is essential. That is, the expansion of the storage system should be completed efficiently and without limitations on the total number of storage nodes or performance. This thesis proposes three ways to improve the above three key features for current large-scale storage systems. Firstly, we define the problem of reverse lookup , namely finding the list of objects (blocks) for a failed node. As the first step of failure recovery, this process is directly related to the recovery/reconstruction time. While existing solutions use metadata traversal or data distribution reversing methods for reverse lookup, which are either time consuming or expensive, a deterministic block placement can achieve fast and efficient reverse lookup. However, the deterministic placement solutions are designed for centralized, small-scale storage architectures such as RAID etc.. Due to their lacking of scalability, they cannot be directly applied in large-scale storage systems. In this paper, we propose Group-Shifted Declustering (G-SD), a deterministic data layout for multi-way replication. G-SD addresses the scalability issue of our previous Shifted Declustering layout and supports fast and efficient reverse lookup. Secondly, we define a problem: how to balance the performance, energy, and recovery in degradation mode for an energy efficient storage system? . While extensive researches have been proposed to tradeoff performance for energy efficiency under normal mode, the system enters degradation mode when node failure occurs, in which node reconstruction is initiated. This very process requires a number of disks to be spun up and requires a substantial amount of I/O bandwidth, which will not only compromise energy efficiency but also performance. Without considering the I/O bandwidth contention between recovery and performance, we find that the current energy proportional solutions cannot answer this question accurately. This thesis present PERP, a mathematical model to minimize the energy consumption for a storage systems with respect to performance and recovery. PERP answers this problem by providing the accurate number of nodes and the assigned recovery bandwidth at each time frame. Thirdly, current distributed file systems such as Google File System(GFS) and Hadoop Distributed File System (HDFS), employ a pseudo-random method for replica distribution and a centralized lookup table (block map) to record all replica locations. This lookup table requires a large amount of memory and consumes a considerable amount of CPU/network resources on the metadata server. With the booming size of Big Data , the metadata server becomes a scalability and performance bottleneck. While current approaches such as HDFS Federation attempt to horizontally extend scalability by allowing multiple metadata servers, we believe a more promising optimization option is to vertically scale up each metadata server. We propose Deister, a novel block management scheme that builds on top of a deterministic declustering distribution method Intersected Shifted Declustering (ISD). Thus both replica distribution and location lookup can be achieved without a centralized lookup table

    Information Technologies for the Healthcare Delivery System

    Get PDF
    That modern healthcare requires information technology to be efficient and fully effective is evident if one spends any time observing the delivery of institutional health care. Consider the observation of a practitioner of the discipline, David M. Eddy, MD, PhD, voiced in Clinical Decision Making, JAMA 263:1265-75, 1990, . . .All confirm what would be expected from common sense: The complexity of modern medicine exceeds the inherent limitations of the unaided human mind. The goal of this thesis is to identify the technological factors that are required to enable a fully sufficient application of information technology (IT) to the modern institutional practice of medicine. Perhaps the epitome of healthcare IT is the fully integrated, fully electronic patient medical record. Although, in 1991 the Institute of Medicine called for such a record to be standard technology by 2001, it has still not materialized. The author will argue that some of the technology and standards that are pre-requisite for this achievement have now arrived, while others are still evolving to fully sufficient levels. The paper will concentrate primarily on the health care system in the United States, although much of what is contained is applicable to a large degree, around the world. The paper will illustrate certain of these pre-requisite IT factors by discussing the actual installation of a major health care computer system at the University of Rochester Medical Center (URMC) in Rochester, New York. This system is a Picture Archiving and Communications System (PACS). As the name implies, PACS is a system of capturing health care images in digital format, storing them and communicating them to users throughout the enterprise

    High performance disk array architectures.

    Get PDF
    Yeung Kai-hau, Alan.Thesis (Ph.D.)--Chinese University of Hong Kong, 1995.Includes bibliographical references.ACKNOWLEDGMENTS --- p.ivABSTRACT --- p.vChapter CHAPTER 1 --- Introduction --- p.1Chapter 1.1 --- The Information Age --- p.2Chapter 1.2 --- The Importance of Input/Output --- p.3Chapter 1.3 --- Redundant Arrays of Inexpensive Disks --- p.5Chapter 1.4 --- Outline of the Thesis --- p.7References --- p.8Chapter CHAPTER 2 --- Selective Broadcast Data Distribution Systems --- p.10Chapter 2.1 --- Introduction --- p.11Chapter 2.2 --- The Distributed Architecture --- p.12Chapter 2.3 --- Mean Block Acquisition Delay for Uniform Request Distribution --- p.16Chapter 2.4 --- Mean Block Acquisition Delay for General Request Distributions --- p.21Chapter 2.5 --- Optimal Choice of Block Sizes --- p.24Chapter 2.6 --- Chapter Summary --- p.25References --- p.26Chapter CHAPTER 3 --- Dynamic Multiple Parity Disk Arrays --- p.28Chapter 3.1 --- Introduction --- p.29Chapter 3.2 --- DMP Disk Array --- p.31Chapter 3.3 --- Average Delay --- p.37Chapter 3.4 --- Maximum Throughput --- p.47Chapter 3.5 --- Simulation with Precise Disk Model --- p.53Chapter 3.6 --- Chapter Summary --- p.58References --- p.59Appendix --- p.61Chapter CHAPTER 4 --- Dynamic Parity Logging Disk Arrays --- p.69Chapter 4.1 --- Introduction --- p.70Chapter 4.2 --- DPL Disk Array Architecture --- p.73Chapter 4.3 --- DPL Disk Array Operation --- p.79Chapter 4.4 --- Performance of DPL Disk Array --- p.83Chapter 4.5 --- Chapter Summary --- p.91References --- p.92Appendix --- p.94Chapter CHAPTER 5 --- Performance Analysis of Mirrored Disk Array --- p.101Chapter 5.1 --- Introduction --- p.102Chapter 5.2 --- Queueing Model --- p.103Chapter 5.3 --- Delay Analysis --- p.104Chapter 5.4 --- Numerical Examples and Simulation Results --- p.108References --- p.109Chapter CHAPTER 6 --- State Reduction in the Exact Analysis of Fork/Join Queues --- p.110Chapter 6.1 --- Introduction --- p.111Chapter 6.2 --- State Reduction For Closed Fork/Join Queueing Systems --- p.113Chapter 6.3 --- Extension To Open Fork/Join Queueing Systems --- p.118Chapter 6.4 --- Chapter Summary --- p.122References --- p.123Chapter CHAPTER 7 --- Conclusion and Future Research --- p.124Chapter 7.1 --- Summary --- p.125Chapter 7.2 --- Future Researches --- p.12

    An extensive study on iterative solver resilience : characterization, detection and prediction

    Get PDF
    Soft errors caused by transient bit flips have the potential to significantly impactan applicalion's behavior. This has motivated the design of an array of techniques to detect, isolate, and correct soft errors using microarchitectural, architectural, compilation­based, or application-level techniques to minimize their impact on the executing application. The first step toward the design of good error detection/correction techniques involves an understanding of an application's vulnerability to soft errors. This work focuses on silent data e orruption's effects on iterative solvers and efforts to mitigate those effects. In this thesis, we first present the first comprehensive characterizalion of !he impact of soft errors on !he convergen ce characteris tics of six iterative methods using application-level fault injection. We analyze the impact of soft errors In terms of the type of error (single-vs multi-bit), the distribution and location of bits affected, the data structure and statement impacted, and varialion with time. We create a public access database with more than 1.5 million fault injection results. We then analyze the performance of soft error detection mechanisms and present the comparalive results. Molivated by our observations, we evaluate a machine-learning based detector that takes as features that are the runtime features observed by the individual detectors to arrive al their conclusions. Our evalualion demonstrates improved results over individual detectors. We then propase amachine learning based method to predict a program's error behavior to make fault injection studies more efficient. We demonstrate this method on asse ssing the performance of soft error detectors. We show that our method maintains 84% accuracy on average with up to 53% less cost. We also show, once a model is trained further fault injection tests would cost 10% of the expected full fault injection runs.“Soft errors” causados por cambios de estado transitorios en bits, tienen el potencial de impactar significativamente el comportamiento de una aplicación. Esto, ha motivado el diseño de una variedad de técnicas para detectar, aislar y corregir soft errors aplicadas a micro-arquitecturas, arquitecturas, tiempo de compilación y a nivel de aplicación para minimizar su impacto en la ejecución de una aplicación. El primer paso para diseñar una buna técnica de detección/corrección de errores, implica el conocimiento de las vulnerabilidades de la aplicación ante posibles soft errors. Este trabajo se centra en los efectos de la corrupción silenciosa de datos en soluciones iterativas, así como en los esfuerzos para mitigar esos efectos. En esta tesis, primeramente, presentamos la primera caracterización extensiva del impacto de soft errors sobre las características convergentes de seis métodos iterativos usando inyección de fallos a nivel de aplicación. Analizamos el impacto de los soft errors en términos del tipo de error (único vs múltiples-bits), de la distribución y posición de los bits afectados, las estructuras de datos, instrucciones afectadas y de las variaciones en el tiempo. Creamos una base de datos pública con más de 1.5 millones de resultados de inyección de fallos. Después, analizamos el desempeño de mecanismos de detección de soft errors actuales y presentamos los resultados de su comparación. Motivados por las observaciones de los resultados presentados, evaluamos un detector de soft errors basado en técnicas de machine learning que toma como entrada las características observadas en el tiempo de ejecución individual de los detectores anteriores al llegar a su conclusión. La evaluación de los resultados obtenidos muestra una mejora por sobre los detectores individualmente. Basados en estos resultados propusimos un método basado en machine learning para predecir el comportamiento de los errores en un programa con el fin de hacer el estudio de inyección de errores mas eficiente. Presentamos este método para evaluar el rendimiento de los detectores de soft errors. Demostramos que nuestro método mantiene una precisión del 84% en promedio con hasta un 53% de mejora en el tiempo de ejecución. También mostramos que una vez que un modelo ha sido entrenado, las pruebas de inyección de errores siguientes costarían 10% del tiempo esperado de ejecución.Postprint (published version

    Evaluation of performance and space utilisation when using snapshots in the ZFS and Hammer file systems

    Get PDF
    Modern file systems implements snapshots, or read-only point-in-time representations of the file system. Snapshots can be used to keep a record of the changes made to the data, and improve backups. Previous work had shown that snapshots decrease read- and write performance, but there was an open question as to how the number of snapshots affect the file system. This thesis studies this on the ZFS and Hammer file systems. The study is done by running a series of benchmarks and creating snapshots of each file system. The results show that performance decreases significantly on both ZFS and Hammer, and ZFS becomes unstable after a certain point; there is a steep decrease in performance, and increase in latency and the variance of the measurements. The performance of ZFS is significantly lower than on Hammer, and the performance decrease is higher. On space utilisation, the results are linear for ZFS, up to the point where the system turns unstable. The results are not linear on Hammer, but more work is needed to reveal by which function.Master i nettverks- og systemadministrasjo
    corecore