269 research outputs found

    Achievable Rates of Concatenated Codes in DNA Storage under Substitution Errors

    Full text link
    In this paper, we study achievable rates of concatenated coding schemes over a deoxyribonucleic acid (DNA) storage channel. Our channel model incorporates the main features of DNA-based data storage. First, information is stored on many, short DNA strands. Second, the strands are stored in an unordered fashion inside the storage medium and each strand is replicated many times. Third, the data is accessed in an uncontrollable manner, i.e., random strands are drawn from the medium and received, possibly with errors. As one of our results, we show that there is a significant gap between the channel capacity and the achievable rate of a standard concatenated code in which one strand corresponds to an inner block. This is in fact surprising as for other channels, such as qq-ary symmetric channels, concatenated codes are known to achieve the capacity. We further propose a modified concatenated coding scheme by combining several strands into one inner block, which allows to narrow the gap and achieve rates that are close to the capacity.Comment: Extended version of a paper submitted to International Symposium on Information Theory and Its Applications (ISITA) 202

    Achievable Information Rates and Concatenated Codes for the DNA Nanopore Sequencing Channel

    Full text link
    The errors occurring in DNA-based storage are correlated in nature, which is a direct consequence of the synthesis and sequencing processes. In this paper, we consider the memory-kk nanopore channel model recently introduced by Hamoum et al., which models the inherent memory of the channel. We derive the maximum a posteriori (MAP) decoder for this channel model. The derived MAP decoder allows us to compute achievable information rates for the true DNA storage channel assuming a mismatched decoder matched to the memory-kk nanopore channel model, and quantify the loss in performance assuming a small memory length--and hence limited decoding complexity. Furthermore, the derived MAP decoder can be used to design error-correcting codes tailored to the DNA storage channel. We show that a concatenated coding scheme with an outer low-density parity-check code and an inner convolutional code yields excellent performance.Comment: This paper has been accepted and awaiting publication in informatio theory workshop (ITW) 202

    On Coding over Sliced Information

    Get PDF
    The interest in channel models in which the data is sent as an unordered set of binary strings has increased lately, due to emerging applications in DNA storage, among others. In this paper we analyze the minimal redundancy of binary codes for this channel under substitution errors, and provide several constructions, some of which are shown to be asymptotically optimal up to constants. The surprising result in this paper is that while the information vector is sliced into a set of unordered strings, the amount of redundant bits that are required to correct errors is order-wise equivalent to the amount required in the classical error correcting paradigm

    On Codes for the Noisy Substring Channel

    Full text link
    We consider the problem of coding for the substring channel, in which information strings are observed only through their (multisets of) substrings. Because of applications to DNA-based data storage, due to DNA sequencing techniques, interest in this channel has renewed in recent years. In contrast to existing literature, we consider a noisy channel model, where information is subject to noise \emph{before} its substrings are sampled, motivated by in-vivo storage. We study two separate noise models, substitutions or deletions. In both cases, we examine families of codes which may be utilized for error-correction and present combinatorial bounds. Through a generalization of the concept of repeat-free strings, we show that the added required redundancy due to this imperfect observation assumption is sublinear, either when the fraction of errors in the observed substring length is sufficiently small, or when that length is sufficiently long. This suggests that no asymptotic cost in rate is incurred by this channel model in these cases.Comment: ISIT 2021 version (including all proofs

    DNA–based data storage system

    Get PDF
    Despite the many advances in traditional data recording techniques, the surge of Big Data platforms and energy conservation issues has imposed new challenges to the storage community in terms of identifying extremely high volume, non-volatile and durable recording media. The potential for using macromolecules for ultra-dense storage was recognized as early as 1959 when Richard Feynman outlined his vision for nanotechnology in a lecture, “There is plenty of room at the bottom”. Among known macromolecules, DNA is unique insofar as it lends itself to implementations of non-volatile recording media of outstanding integrity and extremely high storage capacity. The basic system implementation steps for DNA-based data storage systems include synthesizing DNA strings that contain user information and subsequently retrieving them via high-throughput sequencing technologies. Existing architectures enable reading and writing but do not offer random-access and error-free data recovery from low-cost, portable devices, which is crucial for making the storage technology competitive with classical recorders. In this work we advance the field of macromolecular data storage in three directions. First, we introduce the notion of weakly mutually uncorrelated (WMU) sequences. WMU sequences are characterized by the property that no sufficiently long suffix of one sequence is the prefix of the same or another sequence. For this purpose, WMU sequences used for primer design in DNAbased data storage systems are also required to be at large mutual Hamming distance from each other, have balanced compositions of symbols, and avoid primer-dimer byproducts. We derive bounds on the size of WMU and various constrained WMU codes and present a number of constructions for balanced, error-correcting, primer-dimer free WMU codes using Dyck paths, prefixsynchronized and cyclic codes. Second, we describe the first DNA-based storage architecture that enables random access to data blocks and rewriting of information stored at arbitrary locations within the blocks. The newly developed architecture overcomes drawbacks of existing read-only methods that require decoding the whole file in order to read one data fragment. Our system is based on the newly developed WMU coding techniques and accompanying DNA editing methods that ensure data reliability, specificity and sensitivity of access, and at the same time provide exceptionally high data storage capacity. As a proof of concept, we encoded parts of the Wikipedia pages of six universities in the USA, and selected and edited parts of the text written in DNA corresponding to three of these schools. The results suggest that DNA is a versatile media suitable for both ultrahigh density archival and rewritable storage applications. Third, we demonstrate for the first time that a portable, random-access platform may be implemented in practice using nanopore sequencers. Every solution for DNA-based data storage systems so far has exclusively focused on Illumina sequencing devices, but such sequencers are expensive and designed for laboratory use only. Instead, we propose using a new technology, MinION–Oxford Nanopore’s handheld sequencer. Nanopore sequencing is fast and cheap, but it results in reads with high error rates. To deal with this issue, we designed an integrated processing pipeline that encodes data to avoid costly synthesis and sequencing errors, enables random access through addressing, and leverages efficient portable sequencing via new iterative alignment and deletion error-correcting codes. As a proof of concept, we stored and sequenced around 3.6 kB of binary data that includes two compressed images (a Citizen Kane poster and a smiley face emoji), using a portable data storage system, and obtained error-free read-outs

    Emerging Approaches to DNA Data Storage: Challenges and Prospects

    Get PDF
    With the total amount of worldwide data skyrocketing, the global data storage demand is predicted to grow to 1.75 Ă— 1014GB by 2025. Traditional storage methods have difficulties keeping pace given that current storage media have a maximum density of 103GB/mm3. As such, data production will far exceed the capacity of currently available storage methods. The costs of maintaining and transferring data, as well as the limited lifespans and significant data losses associated with current technologies also demand advanced solutions for information storage. Nature offers a powerful alternative through the storage of information that defines living organisms in unique orders of four bases (A, T, C, G) located in molecules called deoxyribonucleic acid (DNA). DNA molecules as information carriers have many advantages over traditional storage media. Their high storage density, potentially low maintenance cost, ease of synthesis, and chemical modification make them an ideal alternative for information storage. To this end, rapid progress has been made over the past decade by exploiting user-defined DNA materials to encode information. In this review, we discuss the most recent advances of DNA-based data storage with a major focus on the challenges that remain in this promising field, including the current intrinsic low speed in data writing and reading and the high cost per byte stored. Alternatively, data storage relying on DNA nanostructures (as opposed to DNA sequence) as well as on other combinations of nanomaterials and biomolecules are proposed with promising technological and economic advantages. In summarizing the advances that have been made and underlining the challenges that remain, we provide a roadmap for the ongoing research in this rapidly growing field, which will enable the development of technological solutions to the global demand for superior storage methodologies
    • …
    corecore