146 research outputs found

    Beyond Single-Deletion Correcting Codes: Substitutions and Transpositions

    Get PDF
    We consider the problem of designing low-redundancy codes in settings where one must correct deletions in conjunction with substitutions or adjacent transpositions; a combination of errors that is usually observed in DNA-based data storage. One of the most basic versions of this problem was settled more than 50 years ago by Levenshtein, who proved that binary Varshamov-Tenengolts codes correct one arbitrary edit error, i.e., one deletion or one substitution, with nearly optimal redundancy. However, this approach fails to extend to many simple and natural variations of the binary single-edit error setting. In this work, we make progress on the code design problem above in three such variations: - We construct linear-time encodable and decodable length-n non-binary codes correcting a single edit error with nearly optimal redundancy log n+O(log log n), providing an alternative simpler proof of a result by Cai, Chee, Gabrys, Kiah, and Nguyen (IEEE Trans. Inf. Theory 2021). This is achieved by employing what we call weighted VT sketches, a new notion that may be of independent interest. - We show the existence of a binary code correcting one deletion or one adjacent transposition with nearly optimal redundancy log n+O(log log n). - We construct linear-time encodable and list-decodable binary codes with list-size 2 for one deletion and one substitution with redundancy 4log n+O(log log n). This matches the existential bound up to an O(log log n) additive term

    Codes for Correcting Asymmetric Adjacent Transpositions and Deletions

    Full text link
    Codes in the Damerau--Levenshtein metric have been extensively studied recently owing to their applications in DNA-based data storage. In particular, Gabrys, Yaakobi, and Milenkovic (2017) designed a length-nn code correcting a single deletion and ss adjacent transpositions with at most (1+2s)logn(1+2s)\log n bits of redundancy. In this work, we consider a new setting where both asymmetric adjacent transpositions (also known as right-shifts or left-shifts) and deletions may occur. We present several constructions of the codes correcting these errors in various cases. In particular, we design a code correcting a single deletion, s+s^+ right-shift, and ss^- left-shift errors with at most (1+s)log(n+s+1)+1(1+s)\log (n+s+1)+1 bits of redundancy where s=s++ss=s^{+}+s^{-}. In addition, we investigate codes correcting tt 00-deletions, s+s^+ right-shift, and ss^- left-shift errors with both uniquely-decoding and list-decoding algorithms. Our main contribution here is the construction of a list-decodable code with list size O(nmin{s+1,t})O(n^{\min\{s+1,t\}}) and with at most (max{t,s+1})logn+O(1)(\max \{t,s+1\}) \log n+O(1) bits of redundancy, where s=s++ss=s^{+}+s^{-}. Finally, we construct both non-systematic and systematic codes for correcting blocks of 00-deletions with \ell-limited-magnitude and ss adjacent transpositions

    t-Deletion-s-Insertion-Burst Correcting Codes

    Full text link
    Motivated by applications in DNA-based storage and communication systems, we study deletion and insertion errors simultaneously in a burst. In particular, we study a type of error named tt-deletion-ss-insertion-burst ((t,s)(t,s)-burst for short) which is a generalization of the (2,1)(2,1)-burst error proposed by Schoeny {\it et. al}. Such an error deletes tt consecutive symbols and inserts an arbitrary sequence of length ss at the same coordinate. We provide a sphere-packing upper bound on the size of binary codes that can correct a (t,s)(t,s)-burst error, showing that the redundancy of such codes is at least logn+t1\log n+t-1. For t2st\geq 2s, an explicit construction of binary (t,s)(t,s)-burst correcting codes with redundancy logn+(ts1)loglogn+O(1)\log n+(t-s-1)\log\log n+O(1) is given. In particular, we construct a binary (3,1)(3,1)-burst correcting code with redundancy at most logn+9\log n+9, which is optimal up to a constant.Comment: Part of this work (the (t,1)-burst model) was presented at ISIT2022. This full version has been submitted to IEEE-IT in August 202

    Malleable coding: compressed palimpsests

    Full text link
    A malleable coding scheme considers not only compression efficiency but also the ease of alteration, thus encouraging some form of recycling of an old compressed version in the formation of a new one. Malleability cost is the difficulty of synchronizing compressed versions, and malleable codes are of particular interest when representing information and modifying the representation are both expensive. We examine the trade-off between compression efficiency and malleability cost under a malleability metric defined with respect to a string edit distance. This problem introduces a metric topology to the compressed domain. We characterize the achievable rates and malleability as the solution of a subgraph isomorphism problem. This can be used to argue that allowing conditional entropy of the edited message given the original message to grow linearly with block length creates an exponential increase in code length.First author draf

    Hidden Addressing Encoding for DNA Storage

    Get PDF
    DNA is a natural storage medium with the advantages of high storage density and long service life compared with traditional media. DNA storage can meet the current storage requirements for massive data. Owing to the limitations of the DNA storage technology, the data need to be converted into short DNA sequences for storage. However, in the process, a large amount of physical redundancy will be generated to index short DNA sequences. To reduce redundancy, this study proposes a DNA storage encoding scheme with hidden addressing. Using the improved fountain encoding scheme, the index replaces part of the data to realize hidden addresses, and then, a 10.1 MB file is encoded with the hidden addressing. First, the Dottup dot plot generator and the Jaccard similarity coefficient analyze the overall self-similarity of the encoding sequence index, and then the sequence fragments of GC content are used to verify the performance of this scheme. The final results show that the encoding scheme indexes with overall lower self-similarity, and the local thermodynamic properties of the sequence are better. The hidden addressing encoding scheme proposed can not only improve the utilization of bases but also ensure the correct rate of DNA storage during the sequencing and decoding processes
    corecore