48 research outputs found

    Challenges of Long-Term Digital Archiving: A Survey

    No full text
    With an ever-increasing volume of digital records and compliance requirements mandated by regulations, electronic record archiving grows to be more and more important in the digital era. The fundamental functionality of digital archiving includes keeping data content intact and providing provable evidence of events ever happened to the data. The main challenges of long-term digital archiving include: 1)authenticity and integrity of data content; 2)viability of information due to technology obsolescence; 3)reliable, affordable, sustainable and efficient archival media. All modifications to a digital archiving system should be authenticated properly. Authenticity is not enough to protect archived data from human errors or malicious attacks, various redundancy techniques are used to protect data integrity. Furthermore, it is difficult to correctly interpret data created by legacy hardware/software infrastructure on current computing platform as people and organizations are using increasingly complex software tools, data models and semantics, where related formats, standard and semantics are evolving quickly. Standard models and formats are proposed to mitigate the obsolesce problem. For long-term preservation purpose, it is also desirable that the archival media is reliable, affordable, sustainable and efficient. As the size of a single magnetic disk keeps growing to be Tera-scale or even Peta-scale with plumping per-byte cost, magnetic devices become a promising candidate for long-term digital preservation. However, the uncorrectable corruption rates(UER) of 1 bit corruption in 1 Terabyte 1 to 1 bit corruption in 100 Terabyte pose a challenge to the archiving system as the bit corruption may stay unnoticed for months. We propose several strategies to address this problem: checksumming, replication and efficient au

    Content Sharing Graphs for Deduplication-Enabled Storage Systems

    No full text
    Deduplication in storage systems has gained momentum recently for its capability in reducing data footprint. However, deduplication introduces challenges to storage management as storage objects (e.g., files) are no longer independent from each other due to content sharing between these storage objects. In this paper, we present a graph-based framework to address the challenges of storage management due to deduplication. Specifically, we model content sharing among storage objects by content sharing graphs (CSG), and apply graph-based algorithms to two real-world storage management use cases for deduplication-enabled storage systems. First, a quasi-linear algorithm was developed to partition deduplication domains with a minimal amount of deduplication loss (i.e., data replicated across partitioned domains) in commercial deduplication-enabled storage systems, whereas in general the partitioning problem is NP-complete. For a real-world trace of 3 TB data with 978 GB of removable duplicates, the proposed algorithm can partition the data into 15 balanced partitions with only 54 GB of deduplication loss, that is, a 5% deduplication loss. Second, a quick and accurate method to query the deduplicated size for a subset of objects in deduplicated storage systems was developed. For the same trace of 3 TB data, the optimized graph-based algorithm can complete the query in 2.6 s, which is less than 1% of that of the traditional algorithm based on the deduplication metadata

    Efficient Logging and Replication Techniques for Comprehensive Data Protection

    No full text
    Mariner is an iSCSI-based storage system that is designed to provide comprehensive data protection on commodity ATA disk and Gigabit Ethernet technologies while offering the same performance as those without any such protection. In particular, Mariner supports continuous data protection (CDP) that allows every disk update within a time window to be undoable, and local/remote mirroring to guard data against machine/site failures. To minimize the performance overhead associated with CDP, Mariner employs a modified track-based logging technique that unifies the long-term logging required for CDP and short-term logging for low-latency disk writes. This new logging technique strikes an optimal balance among log space utilization, disk write latency, and ease of historical data access. To reduce the performance penalty of physical data replication used in local/remote mirroring, Mariner features a modified two-phase commit protocol that in turn is built on top of a novel transparent reliable multicast (TRM) mechanism specifically designed for Ethernet-based storage area networks. Without flooding the network, TRM is able to keep the network traffic load of reliable N-way replication roughly at the same level as the no-replication case, regardless of the value of N. Empirical performance measurements on the first Mariner prototype, which is built from Gigabit Ethernet and ATA disks, shows that the average end-to-end latency for a 4KByte iSCSI write is under 1.2msec when data logging and replication are both turned on. 1
    corecore