960 research outputs found

    Cold Storage Data Archives: More Than Just a Bunch of Tapes

    Full text link
    The abundance of available sensor and derived data from large scientific experiments, such as earth observation programs, radio astronomy sky surveys, and high-energy physics already exceeds the storage hardware globally fabricated per year. To that end, cold storage data archives are the---often overlooked---spearheads of modern big data analytics in scientific, data-intensive application domains. While high-performance data analytics has received much attention from the research community, the growing number of problems in designing and deploying cold storage archives has only received very little attention. In this paper, we take the first step towards bridging this gap in knowledge by presenting an analysis of four real-world cold storage archives from three different application domains. In doing so, we highlight (i) workload characteristics that differentiate these archives from traditional, performance-sensitive data analytics, (ii) design trade-offs involved in building cold storage systems for these archives, and (iii) deployment trade-offs with respect to migration to the public cloud. Based on our analysis, we discuss several other important research challenges that need to be addressed by the data management community

    Alpha Entanglement Codes: Practical Erasure Codes to Archive Data in Unreliable Environments

    Full text link
    Data centres that use consumer-grade disks drives and distributed peer-to-peer systems are unreliable environments to archive data without enough redundancy. Most redundancy schemes are not completely effective for providing high availability, durability and integrity in the long-term. We propose alpha entanglement codes, a mechanism that creates a virtual layer of highly interconnected storage devices to propagate redundant information across a large scale storage system. Our motivation is to design flexible and practical erasure codes with high fault-tolerance to improve data durability and availability even in catastrophic scenarios. By flexible and practical, we mean code settings that can be adapted to future requirements and practical implementations with reasonable trade-offs between security, resource usage and performance. The codes have three parameters. Alpha increases storage overhead linearly but increases the possible paths to recover data exponentially. Two other parameters increase fault-tolerance even further without the need of additional storage. As a result, an entangled storage system can provide high availability, durability and offer additional integrity: it is more difficult to modify data undetectably. We evaluate how several redundancy schemes perform in unreliable environments and show that alpha entanglement codes are flexible and practical codes. Remarkably, they excel at code locality, hence, they reduce repair costs and become less dependent on storage locations with poor availability. Our solution outperforms Reed-Solomon codes in many disaster recovery scenarios.Comment: The publication has 12 pages and 13 figures. This work was partially supported by Swiss National Science Foundation SNSF Doc.Mobility 162014, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN

    A Guide to Distributed Digital Preservation

    Get PDF
    This volume is devoted to the broad topic of distributed digital preservation, a still-emerging field of practice for the cultural memory arena. Replication and distribution hold out the promise of indefinite preservation of materials without degradation, but establishing effective organizational and technical processes to enable this form of digital preservation is daunting. Institutions need practical examples of how this task can be accomplished in manageable, low-cost ways."--P. [4] of cove

    Open-Source ANSS Quake Monitoring System Software

    Get PDF
    ANSS stands for the Advanced National Seismic System of the U.S.A., and ANSS Quake Monitoring System (AQMS) is the earthquake management system (EMS) that most of its member regional seismic networks (RSNs) use. AQMS is based on Earthworm, but instead of storing files on disk, it uses a relational database with replication capability to store pick, amplitude, waveform, and event parameters. The replicated database and other features of AQMS make it a fully redundant system. A graphical user interface written in Java, Jiggle, is used to review automatically generated picks and event solutions, relocate events, and recalculate magnitudes. Add‐on mechanisms to produce various postearthquake products such as ShakeMaps and focal mechanisms are available as well. It provides a configurable automatic alarming and notification system. The Pacific Northwest Seismic Network, one of the Tier 1 ANSS RSNs, has modified AQMS to be compatible with a freely available, capable, open‐source database system, PostgreSQL, and is running this version successfully in production. The AQMS Software Working Group has moved the software from a subversion repository server hosted at the California Institute of Technology to a public repository at gitlab.com. The drawback of AQMS as a whole is that it is complex to fully configure and comprehend. Nevertheless, the fact that it is very capable, documented, and now free to use, might make it an attractive EMS choice for many seismic networks

    Euclid's US Science Data Center: lessons learned from building a small part of a big system

    Get PDF
    Euclid is an ESA M-class mission to study the geometry and nature of the dark universe, slated for launch in mid-2022. NASA is participating in the mission through the contribution of the near-infrared detectors and associated electronics, the nomination of scientists for membership in the Euclid Consortium, and by establishing the Euclid NASA Science Center at IPAC (ENSCI) to support the US community. As part of ENSCI’s work, we will participate in the Euclid Science Ground Segment (SGS) and build and operate the US Science Data Center (SDC-US), which will be a node in the distributed data processing system for the mission. SDC-US is one of 10 data centers, and will contribute about 5% of the computing and data storage for the distributed system. We discuss lessons learned in developing a node in a distributed system. For example, there is a significant advantage to SDC-US development in sharing of knowledge, problem solving, and resource burden with other parts of the system. On the other hand, fitting into a system that is distributed geographically and relies on diverse computing environments results in added complexity in constructing SDC-US

    Purple Computational Environment With Mappings to ACE Requirements for the General Availability User Environment Capabilities

    Full text link

    Storing and manipulating environmental big data with JASMIN

    Get PDF
    JASMIN is a super-data-cluster designed to provide a high-performance high-volume data analysis environment for the UK environmental science community. Thus far JASMIN has been used primarily by the atmospheric science and earth observation communities, both to support their direct scientific workflow, and the curation of data products in the STFC Centre for Environmental Data Archival (CEDA). Initial JASMIN configuration and first experiences are reported here. Useful improvements in scientific workflow are presented. It is clear from the explosive growth in stored data and use that there was a pent up demand for a suitable big-data analysis environment. This demand is not yet satisfied, in part because JASMIN does not yet have enough compute, the storage is fully allocated, and not all software needs are met. Plans to address these constraints are introduced

    Preserving Our Collections, Preserving Our Missions

    Get PDF
    A Guide to Distributed Digital Preservation is intentionally structured such that every chaptercan stand on its own or be paired with other segments of the book at will, allowing readers topick their own pathway through the guide as best suits their needs. This approach hasnecessitated that the authors and editors include some level of repetition of basic principlesacross chapters, and has also made the Glossary (included at the back of this guide) an essentialreference resource for all readers.This guide is written with a broad audience in mind that includes librarians, curators, archivists,scholars, technologists, lawyers, and administrators. Any resourceful reader should be able to usethis guide to gain both a philosophical and practical understanding of the emerging field ofdistributed digital preservation (DDP), including how to establish or join a Private LOCKSSNetwork (PLN)
    • 

    corecore