4 research outputs found
Cold Storage Data Archives: More Than Just a Bunch of Tapes
The abundance of available sensor and derived data from large scientific
experiments, such as earth observation programs, radio astronomy sky surveys,
and high-energy physics already exceeds the storage hardware globally
fabricated per year. To that end, cold storage data archives are the---often
overlooked---spearheads of modern big data analytics in scientific,
data-intensive application domains. While high-performance data analytics has
received much attention from the research community, the growing number of
problems in designing and deploying cold storage archives has only received
very little attention.
In this paper, we take the first step towards bridging this gap in knowledge
by presenting an analysis of four real-world cold storage archives from three
different application domains. In doing so, we highlight (i) workload
characteristics that differentiate these archives from traditional,
performance-sensitive data analytics, (ii) design trade-offs involved in
building cold storage systems for these archives, and (iii) deployment
trade-offs with respect to migration to the public cloud. Based on our
analysis, we discuss several other important research challenges that need to
be addressed by the data management community
Computing the probability for data loss in two-dimensional parity RAIDs
Parity RAIDs are used to protect storage systems against disk failures. The idea is to add redundancy to the system by storing the parity of subsets of disks on extra parity disks. A simple two-dimensional scheme is the one in which the data disks are arranged in a rectangular grid, and every row and column
is extended by one disk which stores the parity of it. In this paper we describe several two-dimensional parity RAIDs and analyse, for each of them, the probability for data loss given that f random disks fail. This probability can be used to determine the overall probability using the model of Hafner and Rao. We reduce subsets of the forest counting problem to the different cases and show that the generalised problem is #Phard. Further we adapt an exact algorithm by Stones for some of the problems whose worst-case runtime is exponential, but
which is very efficient for small fixed f and thus sufficient for all real-world applications
SimFS: A Simulation Data Virtualizing File System Interface
Nowadays simulations can produce petabytes of data to be stored in parallel
filesystems or large-scale databases. This data is accessed over the course of
decades often by thousands of analysts and scientists. However, storing these
volumes of data for long periods of time is not cost effective and, in some
cases, practically impossible. We propose to transparently virtualize the
simulation data, relaxing the storage requirements by not storing the full
output and re-simulating the missing data on demand. We develop SimFS, a file
system interface that exposes a virtualized view of the simulation output to
the analysis applications and manages the re-simulations. SimFS monitors the
access patterns of the analysis applications in order to (1) decide the data to
keep stored for faster accesses and (2) to employ prefetching strategies to
reduce the access time of missing data. Virtualizing simulation data allows us
to trade storage for computation: this paradigm becomes similar to traditional
on-disk analysis (all data is stored) or in situ (no data is stored) according
with the storage resources that are assigned to SimFS. Overall, by exploiting
the growing computing power and relaxing the storage capacity requirements,
SimFS offers a viable path towards exa-scale simulations
Computing the Probability for Data Loss in Two-Dimensional Parity RAIDs
Parity RAIDs are used to protect storage systems against disk failures. The idea is to add redundancy to the system by storing the parity of subsets of disks on extra parity disks. A simple two-dimensional scheme is the one in which the data disks are arranged in a rectangular grid, and every row and column
is extended by one disk which stores the parity of it. In this paper we describe several two-dimensional parity RAIDs and analyse, for each of them, the probability for data loss given that f random disks fail. This probability can be used to determine the overall probability using the model of Hafner and Rao. We reduce subsets of the forest counting problem to the different cases and show that the generalised problem is #Phard. Further we adapt an exact algorithm by Stones for some of the problems whose worst-case runtime is exponential, but
which is very efficient for small fixed f and thus sufficient for all real-world applications