2,863 research outputs found

    Reconstruction Codes for DNA Sequences with Uniform Tandem-Duplication Errors

    Full text link
    DNA as a data storage medium has several advantages, including far greater data density compared to electronic media. We propose that schemes for data storage in the DNA of living organisms may benefit from studying the reconstruction problem, which is applicable whenever multiple reads of noisy data are available. This strategy is uniquely suited to the medium, which inherently replicates stored data in multiple distinct ways, caused by mutations. We consider noise introduced solely by uniform tandem-duplication, and utilize the relation to constant-weight integer codes in the Manhattan metric. By bounding the intersection of the cross-polytope with hyperplanes, we prove the existence of reconstruction codes with greater capacity than known error-correcting codes, which we can determine analytically for any set of parameters.Comment: 11 pages, 2 figures, Latex; version accepted for publicatio

    Duplication-Correcting Codes for Data Storage in the DNA of Living Organisms

    Get PDF
    The ability to store data in the DNA of a living organism has applications in a variety of areas including synthetic biology and watermarking of patented genetically-modified organisms. Data stored in this medium is subject to errors arising from various mutations, such as point mutations, indels, and tandem duplication, which need to be corrected to maintain data integrity. In this paper, we provide error-correcting codes for errors caused by tandem duplications, which create a copy of a block of the sequence and insert it in a tandem manner, i.e., next to the original. In particular, we present two families of codes for correcting errors due to tandem-duplications of a fixed length; the first family can correct any number of errors while the second corrects a bounded number of errors. We also study codes for correcting tandem duplications of length up to a given constant k, where we are primarily focused on the cases of k = 2, 3

    Low-redundancy codes for correcting multiple short-duplication and edit errors

    Full text link
    Due to its higher data density, longevity, energy efficiency, and ease of generating copies, DNA is considered a promising storage technology for satisfying future needs. However, a diverse set of errors including deletions, insertions, duplications, and substitutions may arise in DNA at different stages of data storage and retrieval. The current paper constructs error-correcting codes for simultaneously correcting short (tandem) duplications and at most pp edits, where a short duplication generates a copy of a substring with length 3\leq 3 and inserts the copy following the original substring, and an edit is a substitution, deletion, or insertion. Compared to the state-of-the-art codes for duplications only, the proposed codes correct up to pp edits (in addition to duplications) at the additional cost of roughly 8p(logqn)(1+o(1))8p(\log_q n)(1+o(1)) symbols of redundancy, thus achieving the same asymptotic rate, where q4q\ge 4 is the alphabet size and pp is a constant. Furthermore, the time complexities of both the encoding and decoding processes are polynomial when pp is a constant with respect to the code length.Comment: 21 pages. The paper has been submitted to IEEE Transaction on Information Theory. Furthermore, the paper was presented in part at the ISIT2021 and ISIT202

    Codes Correcting All Patterns of Tandem-Duplication Errors of Maximum Length 3

    Full text link
    The set of all q q -ary strings that do not contain repeated substrings of length \leq\ell forms a code correcting all patterns of tandem-duplication errors of length \leq\ell , when {1,2,3} \ell \in \{1, 2, 3\} . For {1,2} \ell \in \{1, 2\} , this code is also known to be optimal in terms of asymptotic rate. The purpose of this paper is to demonstrate asymptotic optimality for the case =3 \ell = 3 as well, and to give the corresponding characterization of the zero-error capacity of the (3) (\leq 3) -tandem-duplication channel. This settles the zero-error problem for () (\leq\ell) -tandem-duplication channels in all cases where duplication roots of strings are unique.Comment: 5 pages (double-column format
    corecore