4 research outputs found

    Cold Storage Data Archives: More Than Just a Bunch of Tapes

    Full text link
    The abundance of available sensor and derived data from large scientific experiments, such as earth observation programs, radio astronomy sky surveys, and high-energy physics already exceeds the storage hardware globally fabricated per year. To that end, cold storage data archives are the---often overlooked---spearheads of modern big data analytics in scientific, data-intensive application domains. While high-performance data analytics has received much attention from the research community, the growing number of problems in designing and deploying cold storage archives has only received very little attention. In this paper, we take the first step towards bridging this gap in knowledge by presenting an analysis of four real-world cold storage archives from three different application domains. In doing so, we highlight (i) workload characteristics that differentiate these archives from traditional, performance-sensitive data analytics, (ii) design trade-offs involved in building cold storage systems for these archives, and (iii) deployment trade-offs with respect to migration to the public cloud. Based on our analysis, we discuss several other important research challenges that need to be addressed by the data management community

    Computing the probability for data loss in two-dimensional parity RAIDs

    Get PDF
    Parity RAIDs are used to protect storage systems against disk failures. The idea is to add redundancy to the system by storing the parity of subsets of disks on extra parity disks. A simple two-dimensional scheme is the one in which the data disks are arranged in a rectangular grid, and every row and column is extended by one disk which stores the parity of it. In this paper we describe several two-dimensional parity RAIDs and analyse, for each of them, the probability for data loss given that f random disks fail. This probability can be used to determine the overall probability using the model of Hafner and Rao. We reduce subsets of the forest counting problem to the different cases and show that the generalised problem is #Phard. Further we adapt an exact algorithm by Stones for some of the problems whose worst-case runtime is exponential, but which is very efficient for small fixed f and thus sufficient for all real-world applications

    SimFS: A Simulation Data Virtualizing File System Interface

    Full text link
    Nowadays simulations can produce petabytes of data to be stored in parallel filesystems or large-scale databases. This data is accessed over the course of decades often by thousands of analysts and scientists. However, storing these volumes of data for long periods of time is not cost effective and, in some cases, practically impossible. We propose to transparently virtualize the simulation data, relaxing the storage requirements by not storing the full output and re-simulating the missing data on demand. We develop SimFS, a file system interface that exposes a virtualized view of the simulation output to the analysis applications and manages the re-simulations. SimFS monitors the access patterns of the analysis applications in order to (1) decide the data to keep stored for faster accesses and (2) to employ prefetching strategies to reduce the access time of missing data. Virtualizing simulation data allows us to trade storage for computation: this paradigm becomes similar to traditional on-disk analysis (all data is stored) or in situ (no data is stored) according with the storage resources that are assigned to SimFS. Overall, by exploiting the growing computing power and relaxing the storage capacity requirements, SimFS offers a viable path towards exa-scale simulations

    Computing the Probability for Data Loss in Two-Dimensional Parity RAIDs

    Get PDF
    Parity RAIDs are used to protect storage systems against disk failures. The idea is to add redundancy to the system by storing the parity of subsets of disks on extra parity disks. A simple two-dimensional scheme is the one in which the data disks are arranged in a rectangular grid, and every row and column is extended by one disk which stores the parity of it. In this paper we describe several two-dimensional parity RAIDs and analyse, for each of them, the probability for data loss given that f random disks fail. This probability can be used to determine the overall probability using the model of Hafner and Rao. We reduce subsets of the forest counting problem to the different cases and show that the generalised problem is #Phard. Further we adapt an exact algorithm by Stones for some of the problems whose worst-case runtime is exponential, but which is very efficient for small fixed f and thus sufficient for all real-world applications
    corecore