137 research outputs found

    Efficient data reliability management of cloud storage systems for big data applications

    Get PDF
    Cloud service providers are consistently striving to provide efficient and reliable service, to their client's Big Data storage need. Replication is a simple and flexible method to ensure reliability and availability of data. However, it is not an efficient solution for Big Data since it always scales in terabytes and petabytes. Hence erasure coding is gaining traction despite its shortcomings. Deploying erasure coding in cloud storage confronts several challenges like encoding/decoding complexity, load balancing, exponential resource consumption due to data repair and read latency. This thesis has addressed many challenges among them. Even though data durability and availability should not be compromised for any reason, client's requirements on read performance (access latency) may vary with the nature of data and its access pattern behaviour. Access latency is one of the important metrics and latency acceptance range can be recorded in the client's SLA. Several proactive recovery methods, for erasure codes are proposed in this research, to reduce resource consumption due to recovery. Also, a novel cache based solution is proposed to mitigate the access latency issue of erasure coding

    What broke where for distributed and parallel applications — a whodunit story

    Get PDF
    Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed and parallel systems is a difficult task. These large distributed and parallel systems are composed of various complex software and hardware components. When the system experiences some performance or correctness problem, developers struggle to understand the root cause of the problem and fix in a timely manner. In my thesis, I address these three components of the performance problems in computer systems. First, we focus on diagnosing performance problems in large-scale parallel applications running on supercomputers. We developed techniques to localize the performance problem for root-cause analysis. Parallel applications, most of which are complex scientific simulations running in supercomputers, can create up to millions of parallel tasks that run on different machines and communicate using the message passing paradigm. We developed a highly scalable and accurate automated debugging tool called PRODOMETER, which uses sophisticated algorithms to first, create a logical progress dependency graph of the tasks to highlight how the problem spread through the system manifesting as a system-wide performance issue. Second, uses this logical progress dependence graph to identify the task where the problem originated. Finally, PRODOMETER pinpoints the code region corresponding to the origin of the bug. Second, we developed a tool-chain that can detect performance anomaly using machine-learning techniques and can achieve very low false positive rate. Our input-aware performance anomaly detection system consists of a scalable data collection framework to collect performance related metrics from different granularity of code regions, an offline model creation and prediction-error characterization technique, and a threshold based anomaly-detection-engine for production runs. Our system requires few training runs and can handle unknown inputs and parameter combinations by dynamically calibrating the anomaly detection threshold according to the characteristics of the input data and the characteristics of the prediction-error of the models. Third, we developed performance problem mitigation scheme for erasure-coded distributed storage systems. Repair operations of the failed blocks in erasure-coded distributed storage system take really long time in networked constrained data-centers. The reason being, during the repair operation for erasure-coded distributed storage, a lot of data from multiple nodes are gathered into a single node and then a mathematical operation is performed to reconstruct the missing part. This process severely congests the links toward the destination where newly recreated data is to be hosted. We proposed a novel distributed repair technique, called Partial-Parallel-Repair (PPR) that performs this reconstruction in parallel on multiple nodes and eliminates network bottlenecks, and as a result, greatly speeds up the repair process. Fourth, we study how for a class of applications, performance can be improved (or performance problems can be mitigated) by selectively approximating some of the computations. For many applications, the main computation happens inside a loop that can be logically divided into a few temporal segments, we call phases. We found that while approximating the initial phases might severely degrade the quality of the results, approximating the computation for the later phases have very small impact on the final quality of the result. Based on this observation, we developed an optimization framework that for a given budget of quality-loss, would find the best approximation settings for each phase in the execution

    Hyfs: design and implementation of a reliable file system

    Get PDF
    Building reliable data storage systems is crucial to any commercial or scientific applications. Modern storage systems are complicated, and they are comprised of many components, from hardware to software. Problems may occur to any component of storage systems and cause data loss. When this kind of failures happens, storage systems cannot continue their data services, which may result in large revenue loss or even catastrophe to enterprises. Therefore, it is critically important to build reliable storage systems to ensure data reliability. In this dissertation, we propose to employ general erasure codes to build a reliable file system, called HyFS. HyFS is a cluster system, which can aggregate distributed storage servers to provide reliable data service. On client side, HyFS is implemented as a native file system so that applications can transparently run on top of HyFS. On server side, HyFS utilizes multiple distributed storage servers to provide highly reliable data service by employing erasure codes. HyFS is able to offer high throughput for either random or sequential file access, which makes HyFS an attractive choice for primary or backup storage systems. This dissertation studies five relevant topics of HyFS. Firstly, it presents several algorithms that can perform encoding operation efficiently for XOR-based erasure codes. Secondly, it discusses an efficient decoding algorithm for RAID-6 erasure codes. This algorithm can recover various types of disk failures. Thirdly, it describes an efficient algorithm to detect and correct errors for the STAR code, which further improves a storage system\u27s reliability. Fourthly, it describes efficient implementations for the arithmetic operations of large finite fields. This is to improve a storage system\u27s security. Lastly and most importantly, it presents the design and implementation of HyFS and evaluates the performance of HyFS

    Internet of Things From Hype to Reality

    Get PDF
    The Internet of Things (IoT) has gained significant mindshare, let alone attention, in academia and the industry especially over the past few years. The reasons behind this interest are the potential capabilities that IoT promises to offer. On the personal level, it paints a picture of a future world where all the things in our ambient environment are connected to the Internet and seamlessly communicate with each other to operate intelligently. The ultimate goal is to enable objects around us to efficiently sense our surroundings, inexpensively communicate, and ultimately create a better environment for us: one where everyday objects act based on what we need and like without explicit instructions

    Managing Smartphone Testbeds with SmartLab

    Get PDF
    The explosive number of smartphones with ever growing sensing and computing capabilities have brought a paradigm shift to many traditional domains of the computing field. Re-programming smartphones and instrumenting them for application testing and data gathering at scale is currently a tedious and time-consuming process that poses significant logistical challenges. In this paper, we make three major contributions: First, we propose a comprehensive architecture, coined SmartLab1, for managing a cluster of both real and virtual smartphones that are either wired to a private cloud or connected over a wireless link. Second, we propose and describe a number of Android management optimizations (e.g., command pipelining, screen-capturing, file management), which can be useful to the community for building similar functionality into their systems. Third, we conduct extensive experiments and microbenchmarks to support our design choices providing qualitative evidence on the expected performance of each module comprising our architecture. This paper also overviews experiences of using SmartLab in a research-oriented setting and also ongoing and future development efforts

    Le code à effacement Mojette : Applications dans les réseaux et dans le Cloud

    Get PDF
    Dans ce travail, je présente l'intérêt du code correcteur à effacement Mojette pour des architectures de stockage distribuées tolérantes aux pannes. De manière générale, l'approche par code permet de réduire d'un facteur 2 le volume de données stockées par rapport à l'approche standard par réplication qui consiste à copier la donnée en autant de fois que l'on suppose de pannes. De manière spécifique, le code à effacement Mojette présente les performances requises pour la lecture et l'écriture de données chaudes i.e très régulièrement sollicitées. Ces performances en entrées/sorties permettent par exemple l'exécution de machines virtuelles sur des données distribuées par le système de fichier RozoFS. En outre, j'effectue un rappel de mes contributions dans le domaine des réseaux auto-organisés de type P2P et ad hoc mobile en présentant respectivement les protocoles P2PWeb et MP-OLSR. L'ensemble de ce travail est le fruit de 5 encadrements doctoraux et de 3 projets collaboratifs majeurs

    Data bases and data base systems related to NASA's aerospace program. A bibliography with indexes

    Get PDF
    This bibliography lists 1778 reports, articles, and other documents introduced into the NASA scientific and technical information system, 1975 through 1980
    corecore