960 research outputs found
Cold Storage Data Archives: More Than Just a Bunch of Tapes
The abundance of available sensor and derived data from large scientific
experiments, such as earth observation programs, radio astronomy sky surveys,
and high-energy physics already exceeds the storage hardware globally
fabricated per year. To that end, cold storage data archives are the---often
overlooked---spearheads of modern big data analytics in scientific,
data-intensive application domains. While high-performance data analytics has
received much attention from the research community, the growing number of
problems in designing and deploying cold storage archives has only received
very little attention.
In this paper, we take the first step towards bridging this gap in knowledge
by presenting an analysis of four real-world cold storage archives from three
different application domains. In doing so, we highlight (i) workload
characteristics that differentiate these archives from traditional,
performance-sensitive data analytics, (ii) design trade-offs involved in
building cold storage systems for these archives, and (iii) deployment
trade-offs with respect to migration to the public cloud. Based on our
analysis, we discuss several other important research challenges that need to
be addressed by the data management community
Alpha Entanglement Codes: Practical Erasure Codes to Archive Data in Unreliable Environments
Data centres that use consumer-grade disks drives and distributed
peer-to-peer systems are unreliable environments to archive data without enough
redundancy. Most redundancy schemes are not completely effective for providing
high availability, durability and integrity in the long-term. We propose alpha
entanglement codes, a mechanism that creates a virtual layer of highly
interconnected storage devices to propagate redundant information across a
large scale storage system. Our motivation is to design flexible and practical
erasure codes with high fault-tolerance to improve data durability and
availability even in catastrophic scenarios. By flexible and practical, we mean
code settings that can be adapted to future requirements and practical
implementations with reasonable trade-offs between security, resource usage and
performance. The codes have three parameters. Alpha increases storage overhead
linearly but increases the possible paths to recover data exponentially. Two
other parameters increase fault-tolerance even further without the need of
additional storage. As a result, an entangled storage system can provide high
availability, durability and offer additional integrity: it is more difficult
to modify data undetectably. We evaluate how several redundancy schemes perform
in unreliable environments and show that alpha entanglement codes are flexible
and practical codes. Remarkably, they excel at code locality, hence, they
reduce repair costs and become less dependent on storage locations with poor
availability. Our solution outperforms Reed-Solomon codes in many disaster
recovery scenarios.Comment: The publication has 12 pages and 13 figures. This work was partially
supported by Swiss National Science Foundation SNSF Doc.Mobility 162014, 2018
48th Annual IEEE/IFIP International Conference on Dependable Systems and
Networks (DSN
A Guide to Distributed Digital Preservation
This volume is devoted to the broad topic of distributed digital preservation, a still-emerging field of practice for the cultural memory arena. Replication and distribution hold out the promise of indefinite preservation of materials without degradation, but establishing effective organizational and technical processes to enable this form of digital preservation is daunting. Institutions need practical examples of how this task can be accomplished in manageable, low-cost ways."--P. [4] of cove
Open-Source ANSS Quake Monitoring System Software
ANSS stands for the Advanced National Seismic System of the U.S.A., and ANSS Quake Monitoring System (AQMS) is the earthquake management system (EMS) that most of its member regional seismic networks (RSNs) use. AQMS is based on Earthworm, but instead of storing files on disk, it uses a relational database with replication capability to store pick, amplitude, waveform, and event parameters. The replicated database and other features of AQMS make it a fully redundant system. A graphical user interface written in Java, Jiggle, is used to review automatically generated picks and event solutions, relocate events, and recalculate magnitudes. Addâon mechanisms to produce various postearthquake products such as ShakeMaps and focal mechanisms are available as well. It provides a configurable automatic alarming and notification system. The Pacific Northwest Seismic Network, one of the Tier 1 ANSS RSNs, has modified AQMS to be compatible with a freely available, capable, openâsource database system, PostgreSQL, and is running this version successfully in production. The AQMS Software Working Group has moved the software from a subversion repository server hosted at the California Institute of Technology to a public repository at gitlab.com. The drawback of AQMS as a whole is that it is complex to fully configure and comprehend. Nevertheless, the fact that it is very capable, documented, and now free to use, might make it an attractive EMS choice for many seismic networks
Euclid's US Science Data Center: lessons learned from building a small part of a big system
Euclid is an ESA M-class mission to study the geometry and nature of the dark universe, slated for launch in mid-2022. NASA is participating in the mission through the contribution of the near-infrared detectors and associated electronics, the nomination of scientists for membership in the Euclid Consortium, and by establishing the Euclid NASA Science Center at IPAC (ENSCI) to support the US community. As part of ENSCIâs work, we will participate in the Euclid Science Ground Segment (SGS) and build and operate the US Science Data Center (SDC-US), which will be a node in the distributed data processing system for the mission. SDC-US is one of 10 data centers, and will contribute about 5% of the computing and data storage for the distributed system. We discuss lessons learned in developing a node in a distributed system. For example, there is a significant advantage to SDC-US development in sharing of knowledge, problem solving, and resource burden with other parts of the system. On the other hand, fitting into a system that is distributed geographically and relies on diverse computing environments results in added complexity in constructing SDC-US
Storing and manipulating environmental big data with JASMIN
JASMIN is a super-data-cluster designed to provide
a high-performance high-volume data analysis environment for
the UK environmental science community. Thus far JASMIN
has been used primarily by the atmospheric science and earth
observation communities, both to support their direct scientific workflow, and the curation of data products in the STFC Centre for Environmental Data Archival (CEDA). Initial JASMIN configuration and first experiences are reported here. Useful improvements in scientific workflow are presented. It is clear from the explosive growth in stored data and use that there was a pent up demand for a suitable big-data analysis environment.
This demand is not yet satisfied, in part because JASMIN does not yet have enough compute, the storage is fully allocated, and not all software needs are met. Plans to address these constraints are introduced
Recommended from our members
SOAR (Support Office for Aerogeophysical Research) Annual Report 1995/1996
The Support Office for Aerogeophysical Research (SOAR) was a facility of the National Science Foundation's Office of Polar Programs whose mission is to make airborne geophysical observations available to the broad research community of geology, glaciology and other sciences. The central office of the SOAR facility is located in Austin, Texas within the University of Texas Institute for Geophysics. Other institutions with significant responsibilities are the Lamont Doherty Earth Observatory of Columbia University and the Geophysics Branch of the U.S . Geological Survey. This report summarizes the goals and accomplishments of the SOAR facility during 1995/1996 and plans for the next year.National Science Foundation's Office of Polar ProgramsInstitute for Geophysic
Preserving Our Collections, Preserving Our Missions
A Guide to Distributed Digital Preservation is intentionally structured such that every chaptercan stand on its own or be paired with other segments of the book at will, allowing readers topick their own pathway through the guide as best suits their needs. This approach hasnecessitated that the authors and editors include some level of repetition of basic principlesacross chapters, and has also made the Glossary (included at the back of this guide) an essentialreference resource for all readers.This guide is written with a broad audience in mind that includes librarians, curators, archivists,scholars, technologists, lawyers, and administrators. Any resourceful reader should be able to usethis guide to gain both a philosophical and practical understanding of the emerging field ofdistributed digital preservation (DDP), including how to establish or join a Private LOCKSSNetwork (PLN)
- âŠ